Full general-Purpose Annals

Cortex-M3 Nuts

Joseph Yiu , in The Definitive Guide to the ARM Cortex-M3 (Second Edition), 2010

3.i Registers

As we've seen, the Cortex™-M3 processor has registers R0 through R15 and a number of special registers. R0 through R12 are full general purpose, but some of the xvi-bit Thumb® instructions tin can only access R0 through R7 (low registers), whereas 32-fleck Thumb-2 instructions tin access all these registers. Special registers accept predefined functions and can just be accessed by special register admission instructions.

3.1.1 General Purpose Registers R0 through R7

The R0 through R7 full general purpose registers are also called low registers. They can exist accessed past all xvi-flake Thumb instructions and all 32-fleck Thumb-2 instructions. They are all 32 $.25; the reset value is unpredictable.

3.1.2 Full general Purpose Registers R8 through R12

The R8 through R12 registers are also called loftier registers. They are accessible by all Pollex-2 instructions but non by all sixteen-bit Thumb instructions. These registers are all 32 bits; the reset value is unpredictable (see Figure 3.1).

FIGURE 3.1. Registers in the Cortex-M3.

3.1.three Stack Pointer R13

R13 is the stack pointer (SP). In the Cortex-M3 processor, there are two SPs. This duality allows two separate stack memories to exist set up. When using the annals proper noun R13, you tin can just admission the electric current SP; the other one is inaccessible unless you use special instructions to move to special register from full general-purpose register (MSR) and move special register to full general-purpose register (MRS). The two SPs are every bit follows:

Primary Stack Pointer (MSP) or SP_main in ARM documentation: This is the default SP; it is used past the operating organisation (Bone) kernel, exception handlers, and all application codes that crave privileged access.

Procedure Stack Pointer (PSP) or SP_process in ARM documentation: This is used by the base of operations-level application code (when non running an exception handler).

Stack PUSH and POP

Stack is a retention usage model. It is only part of the system memory, and a pointer register (inside the processor) is used to arrive piece of work as a first-in/concluding-out buffer. The common apply of a stack is to save register contents before some information processing and so restore those contents from the stack afterward the processing task is washed.

FIGURE three.2. Bones Concept of Stack Memory.

When doing PUSH and POP operations, the pointer register, commonly chosen stack arrow, is adapted automatically to prevent next stack operations from corrupting previous stacked data. More details on stack operations are provided on later part of this affiliate.

It is not necessary to employ both SPs. Simple applications tin rely purely on the MSP. The SPs are used for accessing stack memory processes such as PUSH and POP.

In the Cortex-M3, the instructions for accessing stack memory are PUSH and POP. The assembly linguistic communication syntax is as follows (text after each semicolon [;] is a annotate):

Push button   {R0}   ; R13=R13-iv, and so Memory[R13] = R0

Pop   {R0}   ; R0 = Retentiveness[R13], so R13 = R13 + 4

The Cortex-M3 uses a full-descending stack arrangement. (More detail on this subject can be found in the "Stack Memory Operations" department of this chapter.) Therefore, the SP decrements when new data is stored in the stack. PUSH and Pop are usually used to save annals contents to stack retention at the start of a subroutine and then restore the registers from stack at the end of the subroutine. You tin can PUSH or Pop multiple registers in one education:

subroutine_1

  PUSH   {R0-R7, R12, R14} ; Salve registers

  ...   ; Do your processing

  POP   {R0-R7, R12, R14} ; Restore registers

  BX   R14   ; Render to calling role

Instead of using R13, you can use SP (for SP) in your plan codes. It means the aforementioned thing. Inside program code, both the MSP and the PSP tin can be called R13/SP. However, you can access a item one using special register admission instructions (MRS/MSR).

The MSP, also chosen SP_main in ARM documentation, is the default SP after power-upward; it is used by kernel lawmaking and exception handlers. The PSP, or SP_process in ARM documentation, is typically used by thread processes in system with embedded Os running.

Because register Button and Popular operations are always give-and-take aligned (their addresses must exist 0x0, 0x4, 0x8, ...), the SP/R13 bit 0 and bit 1 are hardwired to 0 and always read equally zippo (RAZ).

3.ane.4 Link Register R14

R14 is the link register (LR). Within an assembly program, y'all tin can write it as either R14 or LR. LR is used to shop the return program counter (PC) when a subroutine or office is called—for instance, when yous're using the branch and link (BL) instruction:

master   ; Main program

  ...

  BL function1 ; Phone call function1 using Co-operative with Link instruction.

  ; PC = function1 and

  ; LR = the next education in main

  ...

function1

  ...   ; Programme lawmaking for office i

  BX LR   ; Return

Despite the fact that flake 0 of the PC is always 0 (because instructions are give-and-take aligned or half word aligned), the LR bit 0 is readable and writable. This is because in the Thumb instruction set, bit 0 is often used to indicate ARM/Thumb states. To allow the Thumb-ii program for the Cortex-M3 to piece of work with other ARM processors that back up the Thumb-two technology, this least significant bit (LSB) is writable and readable.

3.i.5 Program Counter R15

R15 is the PC. You tin access it in assembler code by either R15 or PC. Because of the pipelined nature of the Cortex-M3 processor, when you lot read this annals, yous will find that the value is dissimilar than the location of the executing instruction, normally by 4. For example:

0x1000 :   MOV   R0, PC   ; R0 = 0x1004

In other instructions similar literal load (reading of a retentivity location related to current PC value), the effective value of PC might not be pedagogy accost plus iv due to alignment in accost calculation. But the PC value is still at to the lowest degree two bytes ahead of the teaching accost during execution.

Writing to the PC will crusade a branch (simply LRs practice not get updated). Because an instruction address must exist one-half word aligned, the LSB (scrap 0) of the PC read value is always 0. Still, in branching, either by writing to PC or using branch instructions, the LSB of the target accost should exist set to i because it is used to betoken the Pollex state operations. If it is 0, information technology can imply trying to switch to the ARM state and will consequence in a error exception in the Cortex-M3.

Read full chapter

URL:

https://www.sciencedirect.com/scientific discipline/commodity/pii/B9781856179638000065

INTRODUCTION TO THE ARM Teaching SET

ANDREW N. SLOSS , ... CHRIS WRIGHT , in ARM System Programmer's Guide, 2004

iii.5 PROGRAM Status Register INSTRUCTIONS

The ARM instruction set provides 2 instructions to directly control a program status register (psr). The MRS instruction transfers the contents of either the cpsr or spsr into a annals; in the reverse management, the MSR instruction transfers the contents of a register into the cpsr or spsr. Together these instructions are used to read and write the cpsr and spsr.

In the syntax you tin run across a label called fields. This tin exist any combination of command (c), extension (10), status (s), and flags (f). These fields relate to particular byte regions in a psr, as shown in Figure iii.nine.

Effigy 3.9. psr byte fields.

MRS copy program status register to a general-purpose register Rd = psr
MSR move a general-purpose annals to a programme status register psr[field] = Rm
MSR move an immediate value to a program status register psr[field] = firsthand

The c field controls the interrupt masks, Thumb land, and processor mode. Case 3.26 shows how to enable IRQ interrupts by clearing the I mask. This operation involves using both the MRS and MSR instructions to read from and so write to the cpsr.

Example 3.26

The MSR starting time copies the cpsr into annals r1. The BIC instruction clears scrap 7 of r1. Annals r1 is then copied back into the cpsr, which enables IRQ interrupts. You tin come across from this instance that this code preserves all the other settings in the cpsr and only modifies the I scrap in the command field.

This example is in SVC mode. In user mode y'all tin can read all cpsr bits, but you can only update the condition flag field f.

3.5.1 COPROCESSOR INSTRUCTIONS

Coprocessor instructions are used to extend the pedagogy set. A coprocessor can either provide additional computation adequacy or be used to control the retentivity subsystem including caches and memory management. The coprocessor instructions include data processing, annals transfer, and memory transfer instructions. We volition provide merely a short overview since these instructions are coprocessor specific. Note that these instructions are only used past cores with a coprocessor.

CDP coprocessor data processing—perform an functioning in a coprocessor
MRC MCR coprocessor register transfer—move data to/from coprocessor registers
LDC STC coprocessor memory transfer—load and shop blocks of memory to/from a coprocessor

In the syntax of the coprocessor instructions, the cp field represents the coprocessor number between p0 and p15. The opcode fields depict the operation to have place on the coprocessor. The Cn, Cm, and Cd fields describe registers within the coprocessor. The coprocessor operations and registers depend on the specific coprocessor you are using. Coprocessor 15 (CP15) is reserved for system control purposes, such as memory management, write buffer command, cache control, and identification registers.

EXAMPLE 3.27

This example shows a CP15 register being copied into a full general-purpose register.

Hither CP15 register-0 contains the processor identification number. This register is copied into the general-purpose annals r10.

3.5.ii COPROCESSOR fifteen INSTRUCTION SYNTAX

CP15 configures the processor core and has a ready of dedicated registers to shop configuration information, as shown in Example iii.27. A value written into a annals sets a configuration aspect—for case, switching on the enshroud.

CP15 is called the organization control coprocessor. Both MRC and MCR instructions are used to read and write to CP15, where register Rd is the core destination register, Cn is the primary register, Cm is the secondary register, and opcode2 is a secondary annals modifier. You may occasionally hear secondary registers called "extended registers."

Every bit an example, here is the teaching to move the contents of CP15 control annals c1 into register r1 of the processor core:

We apply a autograph annotation for CP15 reference that makes referring to configuration registers easier to follow. The reference note uses the post-obit format:

The first term, CP15, defines it as coprocessor fifteen. The 2nd term, after the separating colon, is the primary register. The master register X can accept a value between 0 and fifteen. The third term is the secondary or extended register. The secondary register Y tin have a value betwixt 0 and 15. The final term, opcode2, is an education modifier and can have a value between 0 and 7. Some operations may also utilise a nonzero value w of opcode1. We write these as CP15:w:cX:cY:Z.

Read full chapter

URL:

https://world wide web.sciencedirect.com/science/article/pii/B9781558608740500046

Overview of the Cortex-M3

Joseph Yiu , in The Definitive Guide to the ARM Cortex-M3 (2nd Edition), 2010

2.2 Registers

The Cortex-M3 processor has registers R0 through R15 (see Effigy two.two). R13 (the stack arrow) is banked, with just one copy of the R13 visible at a time.

Effigy 2.ii. Registers in the Cortex-M3.

2.2.1 R0–R12: General-Purpose Registers

R0–R12 are 32-scrap general-purpose registers for information operations. Some 16-bit Pollex ® instructions tin but admission a subset of these registers (depression registers, R0–R7).

2.two.2 R13: Stack Pointers

The Cortex-M3 contains two stack pointers (R13). They are banked and then that just one is visible at a fourth dimension. The two stack pointers are every bit follows:

Primary Stack Pointer (MSP): The default stack arrow, used by the operating system (Os) kernel and exception handlers

Procedure Stack Pointer (PSP): Used by user awarding code

The lowest two bits of the stack pointers are always 0, which ways they are always word aligned.

two.two.3 R14: The Link Annals

When a subroutine is called, the return address is stored in the link register.

ii.2.iv R15: The Program Counter

The program counter is the current program accost. This register tin be written to command the program catamenia.

2.two.five Special Registers

The Cortex-M3 processor besides has a number of special registers (encounter Figure ii.3). They are as follows:

Plan Status registers (PSRs)

Interrupt Mask registers (PRIMASK, FAULTMASK, and BASEPRI)

Control register (Control)

Effigy ii.3. Special Registers in the Cortex-M3.

These registers accept special functions and can be accessed merely by special instructions. They cannot be used for normal data processing (see Table 2.one).

Table ii.1. Special Registers and Their Functions

Annals Office
xPSR Provide arithmetic and logic processing flags (zero flag and carry flag), execution status, and current executing interrupt number
PRIMASK Disable all interrupts except the nonmaskable interrupt (NMI) and difficult fault
FAULTMASK Disable all interrupts except the NMI
BASEPRI Disable all interrupts of specific priority level or lower priority level
CONTROL Ascertain privileged condition and stack arrow selection

For more information on these registers, run across Chapter iii.

Read full chapter

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9781856179638000053

Early Intel® Architecture

In Power and Performance, 2015

1.one.2 Registers

Aside from the four segment registers introduced in the previous section, the 8086 has seven general purpose registers, and two status registers.

The general purpose registers are divided into two categories. 4 registers, AX, BX, CX, and DX, are classified as data registers. These data registers are accessible as either the total xvi-bit register, represented with the Ten suffix, the low byte of the full 16-chip annals, designated with an Fifty suffix, or the high byte of the 16-bit annals, delineated with an H suffix. For instance, AX would admission the full 16-chip register, whereas AL and AH would access the register's low and high bytes, respectively.

The 2nd classification of registers are the pointer/index registers. This includes the post-obit iv registers: SP, BP, SI, and DI, The SP register, the stack pointer, is reserved for usage as a arrow to the top of the stack. The SI and DI registers are typically used implicitly as the source and destination pointers, respectively. Unlike the data registers, the arrow/index registers are only accessible as total xvi-chip registers.

As this categorization may betoken, the general purpose registers come up with some guidance for their intended usage. This guidance is reflected in the educational activity forms with implicit operands. Instructions with implicit operands, that is, operands which are assumed to exist a certain register and therefore don't crave that operand to be encoded, allow for shorter encodings for common usages. For convenience, instructions with implicit forms typically as well take explicit forms, which require more bytes to encode. The recommended uses for the registers are as follows:

AX Accumulator

BX Data (relative to DS)

CX Loop counter

DX Information

SI Source pointer (relative to DS)

DI Destination pointer (relative to ES)

SP Stack pointer (relative to SS)

BP Base of operations pointer of stack frame (relative to SS)

Aside from assuasive for shorter instruction encodings, this guidance is also an aid to the programmer who, once familiar with the various annals meanings, will be able to deduce the meaning of assembly, bold it conforms to the guidelines, much faster. This parallels, to some degree, how variable names help the programmer reason virtually their contents. It's important to notation that these are just suggestions, not rules.

Additionally, in that location are ii status registers, the education pointer and the flags register.

The education pointer, IP, is also ofttimes referred to as the program counter. This register contains the retention address of the side by side instruction to be executed. Until 64-bit fashion was introduced, the instruction pointer was not directly accessible to the developer, that is, information technology wasn't possible to access it like the other general purpose registers. Despite this, the pedagogy pointer was indirectly accessible. Whereas the education pointer couldn't be modified through a MOV instruction, it could be modified by whatsoever instruction that alters the plan flow, such as the CALL or JMP instructions.

Reading the contents of the instruction pointer was also possible by taking advantage of how x86 handles function calls. Transfer from one part to another occurs through the CALL and RET instructions. The Call instruction preserves the electric current value of the educational activity arrow, pushing it onto the stack in order to support nested office calls, and so loads the instruction pointer with the new address, provided as an operand to the teaching. This value on the stack is referred to as the render address. Whenever the part has finished executing, the RET instruction pops the return accost off of the stack and restores it into the instruction pointer, thus transferring control back to the function that initiated the office call. Leveraging this, the developer tin can create a special thunk role that would merely copy the return value off of the stack, load it into i of the registers, and and then return. For example, when compiling Position-Independent-Lawmaking (PIC), which is discussed in Chapter 12, the compiler will automatically add functions that use this technique to obtain the pedagogy pointer. These functions are ordinarily called __x86.get_pc_thunk.bx(), __x86.get_pc_thunk.cx(), __x86.get_pc_thunk.dx(), so on, depending on which register the instruction pointer is loaded.

The 2nd status register, the EFLAGS register, is comprised of ane-fleck status and control flags. These bits are set past various instructions, typically arithmetic or logic instructions, to betoken sure conditions. These condition flags can and then be checked in order to make decisions. For a list of the flags modified past each instruction, see the Intel SDM. The 8086 defined the following status and control bits in EFLAGS:

Zero Flag (ZF) Set if the outcome of the education is zero.

Sign Flag (SF) Ready if the result of the pedagogy is negative.

Overflow Flag (OF) Set up if the result of the education overflowed.

Parity Flag (PF) Set if the result has an even number of bits ready.

Carry Flag (CF) Used for storing the carry fleck in instructions that perform arithmetics with carry (for implementing extended precision).

Adjust Flag (AF) Similar to the Bear Flag. In the parlance of the 8086 documentation, this was referred to equally the Auxiliary Conduct Flag.

Direction Flag (DF) For instructions that either autoincrement or autodecrement a pointer, this flag chooses which to perform. If set, autodecrement, otherwise autoincrement.

Interrupt Enable Flag (IF) Determines whether maskable interrupts are enabled.

Trap Flag (TF) If set up CPU operates in single-step debugging manner.

Read total chapter

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B978012800726600001X

Intel® Pentium® Processors

In Power and Performance, 2015

2.2.three Out-of-Order Execution

As discussed in Section 2.1.1, prior to the 80486, the processor handled i instruction at a time. As a outcome, the processor's resources remained idle while the currently executing instruction was non utilizing them. With the introduction of pipelining, the pipeline was partitioned to allow multiple instructions to coexist simultaneously. Therefore, when the currently executing instruction had finished with some of the processor's resources, the next instruction could begin utilizing them before the starting time instruction had completely finished executing. The introduction of μops expanded significantly on this concept, splitting instruction execution into smaller steps.

Each type of μop has a corresponding blazon of execution unit. The Pentium Pro has 5 execution units: 2 for handling integer μops, 2 for handling floating point μops, and one for handling memory μops. Therefore, up to five μops tin execute in parallel. An instruction, divided into one or more μops, is not washed executing until all of its corresponding μops have finished. Evidently, μops from the same instruction accept dependencies upon ane another so they tin can't all execute simultaneously. Therefore, μops from multiple instructions are dispatched to the execution units.

Taking reward of the fine granularity of μops, out-of-lodge execution significantly improves utilization of the execution units. Up until the Pentium Pro, Intel processors executed in-lodge, significant that instructions were executed in the same sequence as they were organized in retentivity. With out-of-order execution, μops are scheduled based on the bachelor resource, equally opposed to their ordering. Equally instructions are fetched and decoded, the resulting μops are stored in the Reorder Buffer. Equally execution units and other resources become bachelor, the Reservation Station dispatches the corresponding μop to i of the execution units. In one case the μop has finished executing, the result is stored back into the Reorder Buffer. Once all of the μops associated with an didactics have completed execution, the μops retire, that is, they are removed from the Reorder Buffer and any results or side-effects are made visible to the rest of the organization. While instructions can execute in any order, instructions ever retire in-club, ensuring that the programmer does non demand to worry about handling out-of-order execution.

To illustrate the problem with in-social club execution and the benefit of out-of-order execution, consider the following hypothetical situation. Presume that a processor has ii execution units capable of handling integer μops and ane capable of treatment floating point μops. With in-order scheduling, the nearly efficient usage of this processor would exist to intermix integer and floating signal instructions post-obit the 2-to-ane ratio. This would involve advisedly scheduling instructions based on their instruction latencies, forth with the latencies for fetching any memory resource, to ensure that when an execution unit of measurement becomes available, the next μop in the queue would be executable with that unit.

For example, consider four instructions scheduled on this example processor, three integer instructions followed by a floating point instruction. Assume that each education corresponds to 1 μop, that these instructions have no interdependencies, and that all three execution units are currently available. The outset ii integer instructions would be dispatched to the two available integer execution units, just the floating point teaching would non be dispatched, even though the floating point execution unit was available. This is considering the 3rd integer instruction, waiting for one of the ii integer execution units to become bachelor, must be issued outset. This underutilizes the processor's resources. With out-of-guild execution, the first two integer instructions and the floating betoken instruction would be dispatched together.

In other words, out-of-gild execution improves the utilization of the processor's resources. Additionally, because μops are scheduled based on bachelor resources, some instruction latencies, such as an expensive load from retentivity, may be partially or completely masked if other piece of work tin be scheduled instead.

Annals Renaming

From the instruction ready perspective, Intel processors have eight general purpose registers in 32-flake mode, and sixteen general purpose registers in 64-fleck mode, nonetheless, from the internal hardware perspective, Intel processors have many more than registers. For example, the Pentium Pro has forty registers, organized in a structure referred to as a Physical Register File.

While this many extra registers might seem like a performance boon, especially if the reader is familiar with the functioning gain received from the eight extra registers in 64-bit style, these registers serve a different purpose. Rather than providing the procedure with more registers, these extra registers serve to handle data dependencies in the out-of-guild execution engine.

When a value is stored into a register, a new register file entry is assigned to contain that value. Once some other value is stored into that register, a different register file entry is assigned to comprise this new value. Internal to the processor core, each data dependency on the first value will reference the first entry, and each data dependency on the second value will reference the second entry. Therefore, the out-of-order engine is able to execute instructions in an order that would otherwise be impossible due to false data dependencies.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128007266000021

Load/shop and branch instructions

Larry D. Pyeatt , William Ughetta , in ARM 64-Bit Assembly Language, 2020

three.2 AArch64 user registers

As shown in Fig. 3.2 , the AArch64 ISA provides 31 general-purpose registers, which are called

Image 2

through

Image 3

. These registers can each store 64 $.25 of information. To use all 64 bits, they are referred to every bit

Image 4

through

Image 5

(capitalization is optional). To use merely the lower (least meaning) 32 bits, they are referred to equally

Image 6

. Since each register has a 64-bit proper noun and a 32-chip proper name, we use

Image 7

through

Image 8

to specify a register without specifying the number of bits. For case, when nosotros refer to

Image 9

, we are really referring to either

Image 10

or

Image 11

.

Figure 3.2

Figure 3.2. AArch64 full general purpose registers (

Image 1
) and special registers.

iii.2.1 General purpose registers

The general-purpose registers are each used according to specific conventions. These rules are defined in the awarding binary interface (ABI). The AArch64 ABI is chosen AAPCS64. The difference between callee saved and caller saved registers will also exist explained in Section five.four.iv.

Registers

Image 12
are used for passing arguments when calling a procedure or office Registers
Image 13
are scratch registers and tin can be used at whatever time because no assumptions are made about what they contain. They are called scratch registers because they are useful for holding temporary results of calculations. Registers
Image 14
can as well be used as scratch registers, simply their contents must be saved before they are used, and restored to their original contents before the procedure exits.

Some of the registers have alternate names. For case,

Image 15
is also known equally
Image 16
. Near of these alternating names are only of interest to people writing compilers and operating systems. Nonetheless, 2 of these registers are of interest to all AArch64 programmers.

3.two.two Frame pointer

The frame arrow,

Image 17
, is used by high-level linguistic communication compilers to track the electric current stack frame. This annals can be helpful when the programme is running nether a debugger, and can sometimes help the compiler to generate more than efficient code for returning from a subroutine. The GNU C compiler can be instructed to utilise
Image 17
equally a general-purpose register past using the –fomit-frame-pointer command line option. The use of
Image 17
as the frame pointer is a programming convention. Some instructions (e.g. branches) implicitly modify the plan counter, the link annals, and even the stack pointer, and so they are considered to be hardware special registers. Equally far as the hardware is concerned, the frame pointer is exactly the same as the other general-purpose registers, but AArch64 programmers use it for the frame pointer because of the ABI.

3.two.iii PSTATE annals

The

Image 18

register contains bits that indicate the status of the current process, including data about the results of previous operations. Fig. 3.three shows all of its $.25. The dashed lines indicate unused infinite that may be reserved for futurity AArch64 architectural extensions. The

Image 18

annals is actually a drove of independent fields, most of which are only used by the operating system. User programs make apply of the start four bits, N, Z, C, and V. These are referred to as the condition flags field. Most instructions tin modify these flags, and afterwards instructions can apply the flags to control their operation. Their meaning is as follows:

Negative:

This fleck is set to one if the signed effect of an operation is negative, and set to zippo if the event is positive or zero.

Cipher:

This fleck is ready to one if the effect of an operation is nada, and set to zip if the issue is non-null.

Carry:

This bit is ready to one if an add together operation results in a carry out of the most meaning bit, or if a subtract operation results in a borrow. For shift operations, this flag is prepare to the last bit shifted out by the shifter.

oVerflow:

For add-on and subtraction, this flag is set if a signed overflow occurred.

Figure 3.3

Figure iii.iii. Fields in the PSTATE register.

3.2.4 Link register

The procedure link annals,

Image 5
, is used to hold the return address for subroutines. Certain instructions cause the program counter to be copied to the link register, then the plan counter is loaded with a new address. These branch-and-link instructions are briefly covered in Section 3.5 and in more detail in Section 5.4. The link register could theoretically be used as a scratch register, simply its contents are modified past hardware when a subroutine is chosen, in social club to save the correct render address. Using
Image 5
as a general-purpose annals is dangerous and is strongly discouraged.

3.ii.5 Stack pointer

The program stack was introduced in Section 1.4. The stack arrow,

Image 19
, is used to concur the address where the stack ends. This is ordinarily referred to as the tiptop of the stack, although on virtually systems the stack grows downwards and the stack pointer actually refers to the everyman address in the stack. The address where the stack ends may change when registers are pushed onto the stack, or when temporary local variables (automatic variables) are allocated or deleted. The use of the stack for storing automatic variables is described in Chapter 5. The stack pointer tin just be modified or read by a small set of instructions.

3.2.half dozen Zero annals

The zippo register,

Image 20
, can be referred to as a 64-bit register,
Image 21
, or a 32-bit register,
Image 22
. It ever has the value nothing. Well-nigh instructions can use the aught register as an operand, even as a destination register. If this is the case, the instruction will not change the destination register. All the same, it can yet have side effects, including updating the
Image 18
flags based on the ALU operation and incrementing a register in pre-indexed or post-indexed addressing. The aught register cannot always exist used equally an operand. It shares the same binary encoding with the stack pointer register,
Image 19
, which is the value
Image 23
. Some instructions can admission the zip register, while others can access the stack pointer.

3.2.seven Program counter

The program counter,

Image 24
, e'er contains the accost of the next instruction that will exist executed. The processor increments this register past four, automatically, after each teaching is fetched from memory. By moving an address into this register, the programmer tin crusade the processor to fetch the next instruction from the new address. This gives the programmer the ability to jump to any address and brainstorm executing code there. Simply a pocket-size number of instructions can admission the
Image 24
directly. For example instructions that create a PC-relative address, such equally
Image 25
, and instructions which load a register, such every bit
Image 26
, are able to access the programme counter straight.

Read full affiliate

URL:

https://www.sciencedirect.com/science/commodity/pii/B9780128192214000109

Knights Landing architecture

Jim Jeffers , ... Avinash Sodani , in Intel Xeon Phi Processor High Performance Programming (Second Edition), 2016

Integer execution unit

The IEU executes integer μops, which are defined as those that operate on full general-purpose registers R0–R15 (i.due east., RAX, RCX, RDX, RBX, RSP, RBP, RSI, RDI, R8…R15). There are two IEUs in the core. Each IEU contains 12-entry RS that issues i μop per cycle. The Integer RSes are fully out-of-guild in their scheduling. Well-nigh operations have 1-cycle latency and are supported by both IEUs, simply a few operations have 3- or 5-cycles latency (e.yard., multiplies) and are only supported past one of the IEUs.

Read full affiliate

URL:

https://www.sciencedirect.com/science/commodity/pii/B9780128091944000041

Computer Data Processing Hardware Architecture

Paul J. Fortier , Howard East. Michel , in Computer Systems Operation Evaluation and Prediction, 2003

2.3.1 Instruction types

Based on the number of registers available and the configuration of these registers several types of instruction are possible—for example, if many registers are available, as would be the instance in a stack figurer, no address computations are needed and the instruction, therefore, tin can be much shorter both in format and execution time required. On the other manus, if there are no general registers and all computations are performed by memory movements of data, and so instructions volition be longer and crave more fourth dimension due to operand fetching and storage. The post-obit are representative of teaching types:

0-address instructions—This type of instruction is found in machines where many full general-purpose registers are available. This is the case in stack machines and in some reduced instruction set machines. Instructions of this type perform their role totally using registers. If we have three general registers, A, B, and C, a typical format would have the form:

(2.1) R [ A ] < R [ B ] operator R [ C ]

which indicates that the contents of registers B and C have the operator (such as add, subtract, multiply, etc.) performed on them, with the consequence stored in full general register C. Similarly, we could describe instructions that use only one or two registers as follows:

(2.2) R [ B ] < R [ B ] operator R [ C ]

or

(2.3) operator R [ C ]

which represents two-annals and ane-register instructions, respectively. In the ii-register case 1 of the operand registers is also used every bit the result register. In the single-register instance the operand register is also the issue register. The increment educational activity is an example of ane-register instruction. This type of education is plant in all machines.

1-address instructions—In this type of instruction a single retentiveness address is plant in the instruction. If another operand is used, it is typically an accumulator or the tiptop of a stack in a stack figurer. The typical format of these instructions has the class:

(2.four) operator M [ address ]

where the contents of the named memory address have the named operator performed on them in conjunction with an implied special register. An example of such an instruction could be as follows:

(two.5) Movement Chiliad [ 100 ]

or

(ii.6) Add together Yard [ 100 ]

which moves the contents of retentiveness location 100 into the ALU's accumulator or adds the contents of memory address 100 with the accumulator and stores the result in the accumulator. If the event must be stored in memory, we would need a store instruction:

(2.7) Store Thou [ 100 ]

1-and-l/ii-accost instructions—Once we have an architecture that has some full general-purpose registers, nosotros can provide more advanced operations combining retentivity contents and the general registers. The typical didactics performs an operation on a memory location's contents with that of a general register—for instance, nosotros could add together the contents of a memory location with the contents of a general annals, A, as shown:

(2.8) Add R [ A ] , M [ 100 ]

This education typically stores the issue in the first named location or register in the instruction. In this example it is register A.

2-address instructions—2 address instructions utilize two memory locations to perform an didactics—for instance, a block move of N words from one location in memory to another, or a block add together. The move may appear equally follows:

(2.nine) Move N , M [ 100 ] , Chiliad [ 1000 ]

2-and-fifty/2-address instructions—This format uses 2 memory locations and a full general annals in the pedagogy. Typical of this type of pedagogy is an operation involving two retentivity locations storing the upshot in a register or an operation with a general register and a memory location storing the result on some other memory location, as shown:

(2.ten) R [ A ] > > M [ 100 ] operator One thousand [ 1000 ] M [ 1000 ] > > G [ 100 ] operator R [ A ]

three-address instructions—Another less mutual class of instruction format is the 3-address educational activity. These instructions involve iii memory locations—2 used for operands and one as the results location. A typical format is shown:

(ii.11) Thousand [ 200 ] > > One thousand [ 100 ] operator Thousand [ 300 ]

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9781555582609500023

Advanced Encryption Standard

Tom St Denis , Simon Johnson , in Cryptography for Developers, 2007

x86 Performance

The AMD Opteron achieves a nice boost due to the add-on of the eight new full general-purpose registers. If nosotros examine the GCC output for x86_64 and x86_32 platforms, we can see a nice difference between the two ( Tabular array 4.2).

Table 4.two. First Quarter of an AES Round

Both snippets attain (at least) the first MixColumns step of the first round in the loop. Notation that the compiler has scheduled part of the 2nd MixColumns during the outset to achieve higher parallelism. Even though in Table 4.two the x86_64 code looks longer, it executes faster, partially because information technology processes more of the second MixColumns in roughly the aforementioned fourth dimension and makes expert employ of the extra registers.

From the x86_32 side, we can conspicuously come across diverse spills to the stack (in assuming). Each of those costs us three cycles (at a minimum) on the AMD processors (two cycles on most Intel processors). The 64-bit code was compiled to have zero stack spills during the primary loop of rounds. The 32-bit code has near 15 stack spills during each circular, which incurs a penalty of at to the lowest degree 45 cycles per circular or 405 cycles over the course of the 9 full rounds.

Of grade, we do not run across the full penalty of 405 cycles, as more than one opcode is being executed at the same time. The penalty is also masked by parallel loads that are besides on the critical path (such as loads from the Te tables or round primal). Those delays occur anyways, so the fact that we are also loading (or storing to) the stack at the aforementioned time does not add to the cycle count.

In either case, we can improve upon the lawmaking that GCC (4.1.1 in this instance) emits. In the 64-bit code, we run into a pairing of "shrq $24, %rdx" and "and1 $255,%edx". The andl operation is non required since only the lower 32 bits of %rdx are guaranteed to take anything in them. This potentially saves up to 36 cycles over the course of ix rounds (depending on how the andl operation pairs upward with other opcodes).

With the 32-bit code, the double loads from (%esp) (lines ii and three) incur a needless three-bike penalty. In the example of the AMD Athlon (and Opterons), the load store unit of measurement volition curt the load functioning (in certain circumstances), merely the load will always take at to the lowest degree three cycles. Changing the 2d load to "movl %edx,%ebx" means that nosotros stall waiting for %edx, just the penalty is only one cycle, not three. That change lonely will gratis upwards at most nine*2*4 = 72 cycles from the nine rounds.

Read full chapter

URL:

https://world wide web.sciencedirect.com/scientific discipline/article/pii/B9781597491044500078

Embedded Processor Compages

Peter Barry , Patrick Crowley , in Modern Embedded Computing, 2012

Register Operands

Source and destination operands tin be whatever of the follow registers depending on the instruction being executed:

32-scrap general purpose registers (EAX, EBC, ECX, EDX, ESI, EDI, ESP, or EBP)

16-fleck full general purpose registers (AX, BX, CX, DX, SI, SP, BP)

8-chip general-purpose registers (AH, BH, CH, DH, AL, BL, CL, DL)

Segment registers

EFLAGS annals

MMX

Control (CR0 through CR4)

Organization Table registers (such as the Interrupt Descriptor Table annals)

Debug registers

Auto-specific registers

On RISC embedded processors, there are generally fewer limitations in the registers that tin exist used by instructions. IA-32 often reduces the registers that can be used as operands for certain instructions.

Read full chapter

URL:

https://world wide web.sciencedirect.com/science/article/pii/B9780123914903000059