Fixing timing issues in Static Timing Analysis

HOW TO FIX TIMING PROBLEMS

The setup time and hold time are important timing conditions that need to be maintained to ensure the design goes smoothly. If the setup time is not maintained in the design, incorrect data is latched, leading to setup time violation. Similarly, any violations in the hold time result in the wrong output and are called as hold time violations.

The synthesizer library checks the setup time and input hold time from every clock edge on every flip-flop. If the input setup time and input hold time is violated, the output of the corresponding flip-flop becomes x. An input x to flip-flop results in an output x. Hence, the design is plugged full of x’s and debugging becomes difficult.

Whenever there are setup and hold time violations in any flip-flop, the flip-flop enters a state where the output is unpredictable, and this state is known as the metastable state. At the end of the metastable, the flip-flop either gets a ‘1’ or ‘0’. This process is known as metastability. To manage metastability, setup, and hold time requirements must be met.

The input setup time reduces the logic time available and impacts the long-path. The long-paths in the design must be discovered at RTL design and care must be taken to prevent possible long-paths after place and route. The condition to be satisfied to avoid long-path is:

Cycle time (CT) ≥ Logic delay (Ld) + tC2Q + 2skew + Input setup time (I.S)

The input hold time impacts race alone. The longer the hold time, the more Logic delay (Ld) + tC2Q must be in the path. The most important timing problem to fix is the races. Race fails at all frequencies and if a part of the design is dead then it cannot be used for debugging or evaluation and races also result in silicon change. The condition to be satisfied to avoid a race is:

Logic delay (Ld) + tC2Q - 2skew Input Hold Time (I.H.)

There are certain methods that can be employed to remedy the timing violations in the digital circuit. These methods are explained below:

Use a complex cell

This 3-level logic gate circuit can be replaced by a complex cell such as

The above image is an AND-OR-Invert (AOI) logic 2-level complex gate. The advantage of using a complex cell is that it is much faster than two cells and there is no extra wire having a huge amount of capacitance on it, unlike the two-cell model. So, the use of a complex cell removes the additional capacitance that would have to be driven in place of a common two-cell model thereby reducing the propagation delay. The complex cells are faster as they are only one gate and removes the metal capacitance associated with multiple gates.

Use bigger gates

Switch out gates with a bigger sized gate such as using a 2x gate in place of a 1x. it is beneficial as a bigger size gate has less delay. As the cell size increases, the delay associated with the cell decreases due to the constant factor of the cell in the delay equation is inversely proportional to the size of the cell. Consider the following circuit

If the gates G1 and G2 are driving the load, then G2 can be made faster by increasing the size of the gate from 1x to a 2x Nand gate. As the size of the gate is twice as big, the factor becomes half of the size and the external capacitance is driven faster while the constant remains constant. But, the input capacitance gets twice as big, and G1 slows down. As the size of the gate increases, the amount of drive strength of the previous gate also increases proportionately and slows down. Replacing a gate by a bigger gate as to be done after careful consideration as the circuit can be slowed down by putting in a bigger gate. Hence, sometimes changing the gates, works and sometimes it does not.

Operating certain logic in parallel

The levels of logic in a circuit can be reduced by rearranging and grouping the terms in the logic expression. Consider a simple operation of adding 4 inputs: A1 + B1 + C1 + D1. The synthesizer first creates an adder to add A1 and B1, then it creates another adder and adds the sum of A1 + B1 to C1. Next, the output of the second adder is given to a third adder along with input D1 to produce the output. This results in 3 levels of delay in adders which is not optimal. To remove this delay, parenthesis can be used to reduce the number of logic levels. So, the above example is grouped as (A1 + B1) + (C1 + D1). Operations in parenthesis happen first and then the operations on the outside. The synthesizer first works on (A1 + B1) and (C1 + D1) parallelly and then adds the output of each block. Two additions are computed simultaneously rather than one at a time as in the previous expression. This reduces 3 levels of adder delay to 2 levels of adder delay.

Before using parenthesis: Z = A1+B1+C1+D1; Results in 3 levels of logic

After using parenthesis: Z= (A1+B1) + (C1+D1); Results in 2 levels of logic

A race condition can be fixed by using as many depths of parenthesis as logically makes sense Hence, to fix a race, the part of the circuit causing the race needs to be buried deep within parenthesis. On the other hand, if there is a long path, the part of the circuit causing a long-path needs to be placed outside the parenthesis.

Assign common logic to a local or global variable

Consider a logic expression M N + P Q that is used in 10 places in the design. A problem arises as the synthesizer builds this logic every single time it is called, and it puts a lot of load on the input M, N, P, and Q.

To avoid this crisis, it is best to assign the expression to a local variable. For the above example, let result = (M N + P Q). This results in faster execution as whenever the result is used in the design a single value is used rather than building the gates for the expression. Using intermediate variables results in fewer gates, meaning shorter wires which result in a faster chip.

Using tertiary operators

Long-paths can be reduced by using tertiary operators on design coding. For example, consider the expression ABC + DEFGHIJKQ. A long-path on Q can be avoided by using the following Verilog code (Q)? ABC + DEFGHIJK: ABC. This means that when Q is true ABC + DEFGHIJK is executed and if Q is false then ABC is executed. The synthesizer builds the tertiary operator statement as a MUX with the select line as Q, thereby increasing the speed of the circuit notably.

Lie about the clock cycle

Entering a harder time constraint makes the synthesizer work harder to meet the timing requirements. To start off, restrict the kinds of cells the tool can use to only those cells that are fast and then assigning the tool to use bigger and more powerful gates to make the design better. This results in incremental improvement. Reduced cycle time is used to make the synthesizer increment its optimization efforts by putting faster cells in the design. The capacitance is not known accurately, and the metal is taken as a distribution around an average estimate. So, the synthesizer accounts for the variances in metal delay and prevents having surprise long-paths at the end of the design. The reduced cycle time provides some padding when test and debug signals are added to the design. The reduced cycle time also handles flip-flop changes for the test.

Optimize the hierarchies

The synthesizer optimizes the design by combining multiple components into one but, the synthesizer cannot combine components from different hierarchies. Hence, it is important to organize and prioritize the hierarchies of the modules in the design. It is not possible to combine the elements into one block across hierarchies. Optimizing the hierarchies is especially useful in big and complex design as synthesizing these designs is extremely slow and the memory eventually runs out and fills up the disk.

Temperature

External factors also play an important role in helping fix timing problems. Thermal problems affect the CMOS circuits in various forms such as Electromigration, joule heating, and electron tunneling. As the CMOS heats up, the speed of the CMOS goes down, and hence the signal propagation time in a circuit is directly proportional to the temperature. High temperatures are required to fix long-paths while low temperatures are necessary to find and fix a race condition.

Voltage

Another external factor used to fix long-path and races in a design is the operating voltage. The circuit operates faster at higher voltages although power dissipation is higher when using higher voltages. Operating voltage and signal propagation time in a circuit is inversely proportional to each other.

Retiming

Retiming a circuit is important in fixing the timing violations by dividing the chunks of combinational logic and performing each chunk of logic concurrently. Without retiming, the chunks of combinational logic work independently of one another. Retiming is a generalization of pipelining and is done by moving around the existing delays in the circuit. Retiming a circuit does not change the number of delays in a cycle. The main applications of retiming are:

Reducing the clock period
Reducing the number of registers
Reducing the power consumption
Logic synthesis

Changing the register positions affects the area as the register count changes and affects cycle time as the path delays between registers is changed. Two important rules must be followed for retiming a circuit. The first rule is to add a flip-flop(register) to the output of the circuit if a flip-flop(register) is removed from the input. The second rule is to add a flip-flop(register) to the input of the circuit if a flip-flop(register) is removed from the output. The drawback of retiming the circuit is that retiming cannot be done on large and complex combinational logic because merely distributing the registers will not optimize the design.

Change the design

The last solution to fixing timing problems if the above methods do not help meet the timing requirements is to change the design.

Pipelining

Pipelining is the most effective way to improve the speed of the circuit. Pipelining involves breaking down the logic into smaller sections and placing flip-flops in between these logic sections. Since the logic delay on the path between the two flip-flops is reduced, there is less logic per clock cycle resulting in higher clock frequency. Consider an example where A2+B2+C2*D2 is the logic to be designed. The operations to be done involves addition(A2+B2), addition again (A2+B2)+C2, and finally multiplication A2+B2+C2*D2. Assuming each operation takes 20ns to complete. Before pipelining the total time taken to compute the output is 200*60=12000ns. After pipelining, the time taken is significantly reduced to 4040ns. The steps to pipeline a logic are:

Dividing the combinational logic into blocks
Dependencies of the blocks on each other is mapped
Regrouping the blocks that are independent of each other
Organizing the blocks in a sequential manner.
Adding flip-flops which act as temporary storage elements in between the blocks.

Pipelining affects the performance of the design as the performance of a logic circuit depends on the minimum amount of cycle time for which the design operates correctly. Performance is the reciprocal of the cycle time of the maximum frequency of operation of the circuit.

Performance = 1/Cycle time(CT) . With an increase in every stage of pipelining, the performance also increases.

The advantages of pipelining are:

Increases throughput of the design
Reduces cycle time

The disadvantages of pipelining are:

Delay through the flip-flops increases the latency
If the flip-flops are not timed correctly, then race condition is introduced in the design
The output becomes obsolete if the data is not pushed in the flip-flops.

The drawbacks seen in pipelining is remedied by using 2 flag models. There are two types of 2 flag models:

2 Flag Push model – It has 2 flags, a push flag from the source indicating the sending of data and a stop flag at the destination signaling that the destination cannot take the data. The source is in control and assumption made is that the destination can always take the data unless a stop flag is raised. 2 Flag push models tend to empty the pipeline. The main drawback of this model is that by fixing the long-path in the forward path (push path), a long-path is created in the reverse path (stop path). Here, the stop signal ripples down the pipeline.
2 Flag Pull model – It has 2 flags, a pull flag at the destination indicating it is ready to take the data, and a stop flag at the source signaling it has no data to give. The destination is in control and the assumption made is that the data is always present at the source unless a stop flag is raised. 2 Flag pull models tend to fill the pipeline. Here, the pull signal ripples up the pipeline.

The flagged pipeline model is very efficient in speeding up the design while removing the timing problems.

Flags in Block Interfaces

The blocks inside a chip do not work on the same clock frequency and when switching logic between clock domains, we need to ensure that the signals are synchronized with the clock of the new block. Flags are used to achieve the synchronization of these logic blocks. Flags control the movement of data and indicate whether a data is taken or not. Flags indicate whether a block is ready to receive or accept the data and indicates whether the data on the bus is valid or not. Pipelining changes in one block can ripple to other blocks and debug becomes difficult as there is no way to know when the output data is valid. Flags avoid cycle count changes when one block has a pipeline length change. Change in the clock cycle will impact the other modules in the design and flags are used to isolate the changes to one part of the design.

Flags are added to block interfaces to allow for timing changes without impacting the other modules in the design. Flags enable block reuse and indicate when data is present in a block. Design change problems in the latter half the design can be avoided using flags and also prevent cycle count changes from propagating through the chip as timing problems are resolved. Flags play a major role in simplifying debug.

Some rules are adopted to ensure the smooth flow of the design such as:

Easier to do STA:

Use the same edge of the clock like the rest
Synchronous design
Use of flip-flops only
Use the same types of flip-flops for setup and hold time requirements

To achieve timing constraints

Synthesize early and often to find and fix problems when there is still synthesis time left.
Synthesize to 80% of the cycle time

For reuse

Use standard interfaces such as flag models

Fixing the timing problems after performing Static Timing Analysis is of utmost importance. The above-mentioned methods can be employed to ensure there are no timing violations in the design and make sure the circuit operates correctly. If these critical issues are not addressed and fixed, then it leads to flaws in the design which in turn results in huge losses for chip-making companies.

Fixing timing issues in Static Timing Analysis

Understanding the Dynamics of Group Formation

Amazon Alexa Reviews

Latency Control for Distributed Machine Vision at the Edge

Language