컴퓨터구조 - Pipelined processor

chance·2020년 6월 5일

multi cycle pipelining single cycle 컴퓨터구조

컴퓨터구조

목록 보기

4/6

학교에서 진행되는 컴퓨터구조 수업을 듣고 중요한 부분 위주로 정리하였습니다. 내용 상에 오류가 있다면 댓글로 피드백 부탁드립니다!

프로세서는 CPU를 의미한다.

싱글사이클

한 명령어(instruction) 당 하나의 사이클 안에서 수행된다. 클럭 사이클 타임은 가장 시간이 오래 걸리는 명령어의 처리 시간보다 더 길어야한다.
명령어 중에서 lw가 functional unit을 가장 많이 사용하여 처리 시간이 길다. 다른 처리 시간이 짧게 걸리는 r type과 같은 명령어들조차 lw보다 긴 사이클 타임 동안 기다려야하니 비효율적이다.
한 장치(unit)이 연산을 수행하고 있을 때 다른 유닛들은 사용하고 있지 않아 여러개의 명령을 처리해야할 경우 시간이 오래 걸린다.

멀티사이클

명령어는 Ifetch(instrcution fetch), ID(Instruciton Decode and Registers Fetch), Exec(Calculate the memory address), Mem(Read the data from the Data Memory), Wr(Write the data back to the register file)의 과정으로 만들 수 있다.
이 과정 중에서 lw를 제외하고는 전부가 아닌 일부 과정만들 거친다. 예를 들어 store는 Ifetch -> ID -> Exec -> Mem까지만 수행하면 끝난다.
따라서 한 명령어마다 배정한 사이클을 각각의 operation 단계로 정하면 필요없는 연산 과정을 생략해 성능 향상을 기대할 수 있다.

파이프라이닝

한 functional unit을 사용하는 동안 다른 unit은 아무런 일도 하고있지 않다.
unit이 명령어를 처리했으면 놀고있지 말고 다음 명령어를 받아와서 처리하는 방식은 어떨까? 라는 생각에서 출발하였다.
한 명령어를 처리하는데 걸리는 시간은 그대로이지만 throughput(유닛이 일정한 시간에 처리하는 일의 양)을 늘려 더 빨라보이게 만든다. 실제로 더 빨라진다.

추가해야할 내용들

오류 정정
Load word instruction data/control flow
Load store: memread signal이 1이어야 한다.
If all stages are balanced
I.e., All take the sanme time
Time between instructions(pipelined)/Number of stages
CTp(cycle time pipelined) = CTnp(cycle time num pipelined)/k(stage)
CTnp = k * CTp

K stages를 pipelined로 처리한다고 해보자.
N개의 instruction
To fill the pipeline
Cpi가 1처럼 보이게 함
Tp = [(k-1) + n] CTp
Tnp = N CTnp
Speedup = Tnp/Tp = N CTnp / ([(k-1) + n] Ctp)
= n k Ctp / ([(k-1) + n] Ctp)
= n k / [(k-1) + n]
= k / [(k-1) + n]/n
은 k에 근사

Single cycle

The processor part2 datapath

Pipeline registers
To hold information produced in previous cycle

Pipeline operation
Cycle-by-cycle flow of instructions through the pipelined datapath
Single-clock-cycle pipeline diagram
Multi-clock-cycle diagram
load/store
Wrong register number
Write 레지스터 번지를 임시값이 저장된 레지스터에서 가져옴.
Register write number를 끌어와야함.
Register를 읽는 이벤트와 쓰는 이벤트가 다른 곳에서 발생.
wb에서 뽑아서 끌고 오는 시점까지 가지고 있어야함. 왜??
레지스터 값을 바로 읽어오는게 아니라 이벤트가 있는 시점까지 기다렸다가 읽어야함

Ex for store

Single-cycle pipeline diagram
State of pipeline in a given cycle

Multi-Cycle Pipeline Diagram
Form showing resource usage

Multi-Cycle pipeline diagram

Can help with answering questions like:
How many cycles does it take to execute this code
What is the ALU doing during cycle 4?

Why Pipeline? For Performance!
Once the pipeline is full, one instruction is complete every cycle, so CPI = 1

Mips pipeline datapath additions/mods
State registers between each pipeline stage to isolate them
System clock으로 동기화되어 저장한다. Flip-flop, edge-triggering으로 동작.

Pipeline control(simplified)
What needs to be controlled in each stage?
명령어 type, stage마다 매번 다른 control signal 조합이 나온다.
How would control be handled in an automobile plant?
A fancy control center telling everyone what to do?
Should we use a finite state machine?

Pipelined control
Control signals derived from instruction
각각의 stage별로 내가 필요한 resource가 정해짐.
Stage별로 필요한 control signal이 다름.
원하는 시점에 register에 control signal을 적용한다.

Mips pipeline data and control paths

The Processor

Pipeline hazards
Structural hazards:
Attempt to use the same resource two different ways at the same time
pipeline의 두 단계가 같은 리소스를 동시에 사용한다.
Data hazards:
Attempt to use item before it is ready
Instruction depends on result of prior instruction still in the pipeline

Add r1, r2, r3
Sub r4, r2, r1
Control hazards
Attempt to make a decision before condition is evaluated
Branch instructions
beq r1, r4, loop
add r1, r2, r3
Can always resolve hazards by waiting
Pipeline control must detect the hazard
Take action (or delay action) to resolve hazards

Structure Hazards
Conflict for use of a resource
In MIPS pipeline with a single memory
Load/store requires data access
Instruction fetch would have to stall for that cycle
Would cause a pipeline “bubble”
Hence, pipelined datapaths require separate instruction/data memories
Or separate instruction/data caches

Structural Hazards limit performance
If 1.3 memory accesses per instruction and only one memory access per cycle then
30% load/store로 추론 가능
한번은 instruction memory를 가져옴 -> 100% memory access
부가적인 것 => load/store의 memory access
Average CPI = 1.3
Solution 1: use separate instruction and data memories(포트를 추가하는 것도 포함)
Solution 2: allow memory to read and write more than one word per cycle
Solution 3: stall(stop running)

Data Hazard
$2 cannot be read by other instructions before it is written by the add.
Sub $2, $1, $3
add $12, $2, $5
or $13, $6, $2
Add $14, $2, $2
Sw $15, 100($2)

Dependencies
5번째에서 register update
나머지 instruction들이 그 전에 register에 access 시도.
5번째 사이클의 실행결과에 의존적.
Dependencies that go backward in time are data hazards

Software Solution
Have compiler guarantee no hazards
Insert the stalls(nops)
Problem: this really slows us down!

Stall: One Way to Fix a Data Hazard
sub $2, $1, $3
and $12, $2, $5 => 1사이클 멈춰야 함
Snd $12, $2, $5 =>
동시에 read와 write 시도시에는?
엣지가 뛴 다음 순간에 값을 업데이트
엣지가 뛰기 전에 읽음 => 옛날 값을 읽음
Can fix data hazard by waiting
Stall
But affects throughput

해결책: 내부적으로 clock edge triggering해야 update가 되기 때문에 내부적으로 bypass할 수 있도록 만든다.

How about register file access?
Internal bypassing path
Fix register file access hazard by doing reads in the second half of the cycle and wites in the first half
같은 read/write port면 전달해줄 수 있도록 구성
1 사이클을 줄일 수 있음
Writing 하는 결과를 입력으로 전달해주는 내부 path를 만든다.

Another way to “fix” a data hazard
Use temporary results, don’t wait for them to be written
Register file forwarding to handle read/write to same register
Requires extra connections in the datapath
Stage(temporal) register에 이미 결과가 만들어져 있다.
Register file에 기록이 안되어져 있을뿐
Temporal register에 있는 값들을 forwarding할 수 있는 path를 만든다.

Forwarding
Can fix data hazard by forwarding results as soon as they are available to where they are needed.

Pipeline hazards

Data forwarding(aka bypassing)
Any data dependence line that goes backwards in time
Ex stage generating r-type alu results or effective address calculation
Mem stage generating lw results
Forward by taking the inputs to the alu from any pipeline register rather than just id/ex by
Adding multiplexors to the inputs of the alu
id/ex pipeline registers
ex/mem pipeline registers
mem/wb pipeline registers
Can run at full speed
ForwardA, forwardB라는 multiplexor unit 추가
Alu result 또는 memory write back의 result를 forwarding

Data forwarding control conditions
EX/MEM hazard
Forward가 alu result를 forward해주는 경우
if(EX/MEM Register Rd == ID/EX.RegisterRs))
forwardA = 10
if (EX/MEM.RegisterRd = ID/EX.RegisterRt)
forwardB = 10
MEM/WB hazard:
If (MEM/WB.RegWrite)
And (MEM/WB.RegisterRd == ID/Ex.RegisterRs)
ForwardA = 01
If (MEM/WB.RegWrite)
And (MEM/WB.RegisterRd == ID/Ex.RegisterRt)
forwardB = 01
???
Register 0번은 true zero를 의미하여 안된다.
Yet another complication!
Another potential data hazard can occur
어디서 forward할 지에 대한 decision making이 정해져있지 않다. (그 전인지, 그 전전인지)
더 최신의 결과를 가져와야함 -> 바로 전의 것으로 가져옴

Memory-to-memory copies
Lw -> sw
For loads immediately followed by stores(mem-to-mem copies) can avoid a stall by adding forwarding hardware from the mem/wb register to the data memory input.
Would need to add a Forward Unit to the memory access stage
Should avoid stalling on such a lod

Forwarding(or bypassing): what about loads
Lw $2, 20($1)
And $4, $2, $5
Can’t always avoid stalls by forwarding
If value not computed when needed
Can’t forward backward in time!
Must delay/stall instruction dependent on loads

Can’t always forward
Load word can still cause a hazard, an instruction tries to read a register following a load instruction that writes to the same register.

Stalling
Software: No-op를 집어넣는 경우도 있음
Hardware detection -> called stall
mem/wb에 결과 써짐 -> exe/mid/exe 과정에서 강제적으로 exe stage로 넘어가는 과정을 멈춤
나머지도 모두 한 사이클 다 밀림.

Stall/bubble in the pipeline
Pipeline 사이에 빈 instruction공간: bubble(no-op)
Pipeline: instruction을 처리해주는 path(route)

Load-use hazard detection unit
If (ID/EX MemRead => Mem address calculation, lw
And ((ID/EX.RegisterRt = IF/ID.RegisterRs)
Or (ID/Ex.RegisterRt = IF/ID.RegisterRt)))
Stall the pipeline

The first line tests to see if the instruction is a load
The next two lines check to see if the destination register of the load in the ex stage matches either source registers of the instruction in the ID stage

Stall hardware
Prevent the if and id stage instructions from making progress down the pipeline,
Done by preventing the PC register and the IF/ID pipeline register from changing
Hazard detection unit controls the writing of the PC and IF/ID registers
The instructions in the back half of the pipeline starting with the EX stage must be flushed(execute noop)
Must deassert the control signals(setting them to 0) in the EX, MEM, and WB control fields of the ID/EX pipeline register.
Hazard detection unit controls the multiplexor that chooses between the real control values and 0’s
Assumes that 0’s are benign values in datapath: nothing changes

Code scheduling to avoid stalls
9 + 4 cycles
Lw를 옮겨가서 forward가 가능해짐.
모든 명령어가 stall 없이 실행 가능해짐.
7 + 4 -> 11사이클

Pipeline에 의해 instruction code가 같더라도 컴파일러의 코드 stalling 방법에 따라 성능 차이가 발생함.

chance

프론트엔드와 알고리즘을 주로 다룹니다.

이전 포스트

컴퓨터구조 - datapath🧭

다음 포스트

컴퓨터구조 - Pipelined processor