2022.01.14 16:35

Computer organization design hardware software interface 4th edition solution manual

->>>> Click Here to Download <<<<<<<-

Thinking and Deciding, 4th Ed. Fundamentals of Electronic Circuit Design solutions manual. Instructor Solutions Manual 1,2, and 3. Theoretical Hydrodynamics, 4th Ed. Practical Immunology, 4th ed. Ed Howdershelt 4th Wish. Leukaemia Diagnosis 4th ed. Supramolecular Organization and Materials Design. Supramolecular organization and materials design. Recommend Documents. Wade University of Tennessee, Kno Apago PDF Your name. Close Send. Remember me Forgot password? Our partners will collect data and use cookies for ad personalization and measurement.

Data hazards occur when the memory address used by the instruction depends on the result of a previous instruction EXE to ARD, 2 stall cycles or the instruction after that 1 stall cycle , or when an instruction writes a value to memory and one of the next Sol All other data dependences can use forwarding to avoid stalls.

I3: sll r2,r2,2 I4: add r2,r2,r4 I9 stalls, but we can do I2 from the next iteration instead. I6: add r3,r3,r2 I12 stalls and all subsequent instructions that remain have I7: sw r3,-4 sp dependences so this stall remains.

I1: addi r4,r4,4 I4 stalls, but we can do I8 instead. I1: lw t1,-4 sp I6 stalls, and all subsequent instructions have dependences. I4: add t4,t3,r4 I5: lw t5,0 t4 I9 stalls, but we can do I1 from the next iteration instead. I6: add r3,t2,t5 I12 stalls, but we can do I2 from the next iteration instead. I1: addi r4,r4,4 This loop can now execute without stalls. After I5 we execute I4, so I6 no longer I3: lw t2,4 r4 stalls.

Note that I8 reads what I7 wrote to I3: sll t3,t1,2 memory, so these instructions are still dependent. I4: add t4,t3,r4 I5: lw t5,0 t4 I9 would stall, but we can do I1 from the next iteration I6: add t6,t2,t5 instead. I7: sw t6,-4 sp I12 would stall, but we can do I2 from the next iteration I8: lw t7,-4 sp instead. I1: addi t1,t1,4 No stalls remain. After I5 we execute I4, so I6 no longer stalls.

I3: lw t3,4 t1 I4: add t4,t3,t2 In next iteration uses of r4 renamed to t3. The number of instruc- tions between mispredictions is one divided by the number of mispredictions per instruction. We get: Mispredictions per Instruction Instructions between Mispredictions a. The number of in-progress branches can then be easily com- puted because we know what percentage of all instructions are branches.

We have: In-progress Branches a. If the branch outcome is known in stage N of the pipeline, all instructions are from the wrong path in N — 1 stages. In the Nth stage, all instructions after the branch are from the wrong path. Assuming that the branch is just as likely to be the 1st, 2nd, 3rd, or 4th instruction fetched in its cycle, we have on average 1.

We have: Wrong-path Instructions a. To compute the CPI, we note that we have determined the number of useful instructions between branch mispredictions for 4. From that we can determine the number of cycles between branch mispredictions, and then the CPI cycles per use- ful instruction. This is because the 8-issue processor needs fewer cycles to execute the same number of instructions, so the same 1-cycle improvement represents a large relative improvement speedup.

For the 1-issue 5-stage processor we have a CPI of 1 and a clock cycle time of T. Overall, we get a speedup of: Speedup a. Then we compute the number of cycles needed to execute these instructions if there were no misprediction stalls, and the number of stall cycles due to a misprediction. Note that the number of cycles spent on a misprediction is the number of entire cycles one less than the stage in which branches are executed and a fraction of the cycle in which the mispredicted branch instruction is.

The fraction of a cycle is determined by averaging over all possibilities. Since we will only execute one kind of instruction, we do not need to decode the instruction but we still need to read registers. As a result, we will need an ID pipeline stage although it would be misnamed. After that, we have an EXE stage, but this stage is sim- pler because we know exactly which operation should be executed so there is no need for an ALU that supports different operations.

Also, we need no Mux to select which values to use in the operation because we know exactly which value it will be. In the ID stage we read two registers and we do not need a sign-extend unit. Note that there is no MEM stage, so this is a 4-stage pipeline. Also note that the PC is always incremented by 4, so we do not need the other Add and Mux units that compute the new PC for branches and jumps. We read two registers in the ID stage, and we also need the sign-extend unit for the Offs field in the instruction word.

After the EXE stage we use the output of the Add unit as a memory address in the MEM stage, and the value we read from Rt is used as a data value for a memory write.

Note that there is no WB stage, so this is a 4-stage pipeline. No hazard detection unit is needed because forwarding eliminates all hazards. There is no need for forwarding or hazard detection in this pipeline because there are no RAW data dependences between two store instructions. The decoding logic must simply check whether the opcode and funct filed if there is a funct field Sol The two operations are identical until the end of the EXE stage.

In fact, the work of the WB stage can be done in the MEM stage, so our pipeline remains a 4-stage pipeline. Now we need forwarding because of ADDI instructions. Fortunately, we still need no hazard detection. After that, the exception handling is the same as for 4.

We have: Delay Slots Needed a. These cycles are reduced by filling them with delay-slot instructions. Overall, an average branch instruction is now accompanied by 0. Note that these NOPs are added for every branch, not just mispredicted ones. In the cycle in which we are placing the last micro-op of an Sol Note that this results in executing up to one micro-op per cycle, but we are actually fetching instructions less often than that. We need to add an incrementer in the MEM stage.

This incrementer would increment the value read from Rs while memory is being accessed. We also need to write this incremented value back into Rs. We can use the existing EX stage to perform this address calculation and then write to memory in the MEM stage. But we do need an additional third register read port because this instruction reads three registers in the ID stage, and we need to pass these three values to the EX stage.

Note that these changes slow down all the other instructions, so we are speeding up a relatively small fraction of the execution while slowing down everything else. As a result, each cycle in which we executed an ADDM instruction now adds three more cycles to the execution. For the least utilized unit, we have: a. The read port of the data memory is never used no load instructions.

We also need 5 bits for each of the three register fields from the instruction word Rs,Rt,Rd , and 10 bits for all the control signals output by the Control unit. We also need 5 bits for the number of the destination register and 4 bits for control signals.

The grand total for all pipeline registers is bits. In the ID stage, the critical path is the latency to read Regs. For a single-cycle design, the clock cycle time is the sum of these per-stage latencies for a load instruction. For a pipelined design, the clock cycle time is the longest of the per-stage latencies. To compare these clock cycle times, we compute a speedup based on clock cycle time alone assuming the number of clock cycles is the same in single-cycle and pipelined designs, which is not true.

We still need Muxes before the ALU for forwarding. For a pipelined design, we must reduce latencies of all stages that have longer latencies than the target latency.

Every instruction also results in two register reads even if only one of those values is actually used. A load instruction results in a memory read and a register write, a store instruction results in a memory write, and all other instructions result in either no register write e. Because the sum of memory read and register write energy is larger than memory write energy, the worst-case instruction is a load instruction.

For the energy spent by a load, we have: a. However, we can avoid reading registers whose values are not going to be used. To do this, we must Sol We must generate these control signals quickly to avoid lengthening the clock cycle time.

After the change, the latencies of Control and Register Read cannot be overlapped. This change does not affect clock cycle time because the clock cycle time must already allow enough time for memory to be read in the MEM stage. It does affect energy: a memory read occurs in every cycle instead of only in cycles when a load instruction is in the MEM stage.

The IPC Sol The energy per cycle is the total of the energy expenditures in all five stages. For each stage, we can compute the factor X for it by dividing the new latency clock cycle time by the original latency. We then compute the new per-cycle energy consumption for each stage by dividing its energy by its factor X. After that this problem is solved in the same way as 4. B[I][0] 5. A[I][J] b. A[J][I] 5. No solution provided b.

Solution 5. P1 Because the disk bandwidth grows much faster than seek latency, future paging cost will be closer to constant, thus favoring larger pages. When most missed TLB entries are cached in processor caches. Unfortunately, a cache controller cannot know the future! Our best alterna- tive is to make a good prediction. On the other hand, you could worsen the miss rate by choosing poorly which addresses to cache.

Nested page table: 1 VM creates new page table, hypervisor adds new mappings in PA to MA table; 2 hardware walks both page tables to translate VA to MA; 3 VM and hypervisor update their page tables, hypervisor invalidates stale TLB entries; 4 same as shadow page table. Virtual machines aim to provide each operating system with the illusion of having the entire machine to its disposal. Thus they both serve very similar goals, and offer benefits such as increased security.

Virtual memory can allow for many applications running in the same memory space to not have to manage keeping their memory separate. Each ISA has specific behaviors that will happen upon instruction execution, interrupts, trapping to kernel mode, etc. This can require many more instructions to be executed to emulate each instruction than was origi- nally necessary in the target ISA. This can cause a large performance impact and make it difficult to properly communicate with external devices.

An emulated sys- tem can potentially run faster than on its native ISA if the emulated code can be dynamically examined and optimized. This is similar to the recent Intel processors that do micro-op fusion, allowing several instructions to be handled by fewer instructions.

If the cache is not able to satisfy hits while writing back from the write buffer, the cache will perform little or no better than the cache without the write buffer, since requests will still be serialized behind writebacks. Once the memory channel is free, the cache is able to issue the read request to satisfy the miss.

The memory read should come before memory writes. Group the srcIP and refTime fields into a separate array. Split the srcIP into a separate array; have a hash table on the browser field.

Example cache: 4-block caches, direct-mapped vs. Reference stream blocks : 1 2 2 6 1. Auto Pilot Keypad — 0. Automated Thermostat Keypad — 0. With the emergence of inexpensive drives, hav- ing a nearly 0 replacement time for hardware is quite feasible. However, replacing file systems and other data can take significant time. Although a drive manufac- turer will not include this time in their statistics, it is certainly a part of replacing a disk.

However, availability would be quite high if MTTF also grew measurably. Solution 6. Interestingly, by doubling the block size, the RW time changes very little. Thus, block size does not seem to be critical. No An aircraft control system will process frequent requests for small amounts of information.

Increasing the sector size will decrease the rate at which requests can be processed. No A phone switch processes frequent requests for small data elements.

Increasing sector size will potentially reduce performance. Faster access to disk may be useful in some situations, but not normal operation. Faster access to disk may be useful, but may improve performance in limited scenarios. No Failure in an aircraft control system is not tolerable. Increasing disk failure rate for faster data access is not acceptable. No Failure in a phone switch is not tolerable. In effect, if data transfer time remains constant, performance should increase.

What is interesting is that disk data transfer rates have always outpaced improvements with disk alternatives. FLASH is the first technology with potential to catch hard disk.

No Increased drive performance is not an issue in an aircraft controller. No Increased drive performance is not an issue in a phone switch. No Solution 6. The printer is electrically distant from the CPU. Scanner inputs are relatively infrequent in comparison to other inputs. The scanner itself is electrically distant from the CPU. Specifically, long synchronous busses typically use parallel cables that are subject to noise and clock skew.

The longer a parallel bus is, the more susceptible it is to environmental noise. Balanced cables can prevent some of these issues, but not without significant expense. Clock skew is also a problem with the clock at the end of a long bus being delayed due to transmission distance or distorted due to noise and transmission issues. If a bus is electrically long, then an asynchronous bus is usually best.

Usually, asynchronous busses are serial. Thus, for large data sets, transmission can be quite high. If a device is time sensitive, then an asyn- chronous bus may not be the right choice.

There are certainly exceptions to this rule of thumb such as FireWire, an asynchronous bus that has excellent timing properties. FireWire would not be as appropriate due to its daisy chaining implementation.

PCI due to higher throughput. No need for hot swap capabilities and the device will be close to the CPU. Individual devices do not have controllers, but send requests and receive commands from the bus controller through their control lines.

Although the data bus is shared among all devices, control lines belong to a single device on the bus. FireWire Uses a daisy chain approach. A controller exists in each device that generates requests for the device and processes requests from devices after it on the bus. Devices relay requests from other devices along the daisy chain until they reach the main bus controller. Having a fixed number of control lines limits the number of devices on the bus.

The trade-off is speed. PCI busses are not useful for peripherals that are physically distant from the computer. USB Serial communication implies longer communication distances, but the serial nature of the communication limits communication speed. USB busses are useful for peripherals with relatively low data rates that must be physically distant from the computer.

FireWire Daisy chaining allows adding theoretically unlimited numbers of devices. However, when one device in the daisy chain dies, all devices further along the chain cannot communicate with the controller. When the devices requires attention or is available, the polling process communicates with it. Interface may be handled by polling, but not control or sensor inputs. Yes Sol While polling requires a process to periodically examine the state of a device, inter- rupts are raised by the device and occur when the device is ready to communicate.

When the CPU is ready to communicate with the device, the handler associated with the interrupt runs and then returns control to the main process. Aircraft surfaces generate interrupts caused by movements. Controller generates signals back to control surfaces. User displays can be managed by either polling or interrupts.

Polling is okay. It inputs 32 single word values from various sensors on control surfaces and generates 32 single word values as control signals to actuators. Status for 32 potential alarm values is stored in one word while four words store navigational information. An automated thermostat is a simple device, but it has both input and output functions.

The keypad memory should hold values input by toggle switches and numeric entries. Similarly, control surfaces can be controlled by issuing individual commands or issuing commands with state for several sensors. A graphics card is an excellent example. A memory map can be used to store information that is to be displayed. Then, a command can be used to actually display the information.

Similar techniques would work for other devices from the table. The status register is saved to assure that any lower priority interrupts that have been detected are handled when the status register is restored following handling of the current interrupt. Mouse Controller: 3 Power Down: 2 Overheat: 1 6. Ethernet Controller Data Save the current program state. Jump to the Ethernet controller Interrupt code and handle data input. Restore the program state and continue execution.

Overheat Interrupt Jump to an emergency power down sequence and begin execution. Mouse Controller Interrupt Save the current program state. Jump to the mouse controller code and handle input. Reboot Interrupt Jump to address 0 and reinitialize the system. Zeroing all bits in the mask would have the same affect. Specifically, when an interrupt is handled that does not terminate execution, the running program must return to the point where the interrupt occurred.

Handling this in the operating system is certainly feasible, but this solution requires storing information on the stack, in registers, in a dedicated memory area, or some combination of the three. Providing hardware support removes the burden of storing program state from the operating system. Specifically, program state information need not be pulled from the CPU and stored in memory. This is essentially the same as handling a function call, except that some inter- rupts do not allow the interrupted program to resume execution.

Like an interrupt, a function must store program state information before jumping to its code. There are sophisticated activation record management protocols and frequently support- ing hardware for many CPUs.

Higher priority interrupts are handled first and lower priority interrupts are disabled when a higher priority interrupt is being handled. Even though each interrupt causes a jump to its own vector, the interrupt system implementation must still handle interrupt signals. Both approaches have roughly the same capabilities. The CPU initiates the data transfer, but once the data transfer starts, the device and memory communicate directly with no intervention from the CPU.

The dataflow back and forth from a mouse is insignificant. One thought is the Ethernet controller handles significant amounts of data. However, that data is typically in relatively small packets. Depending on the functionality performed by the controller, it may or may not make sense to have it use DMA. A frame handled by a graphics card may be huge but is treated as one display action. Conversely, input from a mouse is tiny. The mouse controller will not use DMA. The Ethernet controller will not use DMA.

Basically, any device that writes to memory directly can cause the data in memory to differ from what is stored in cache. If a page is not in memory when an address associated with it is accessed, the page must be loaded, potentially displacing another page.

Virtual memory works because of the principle of locality. Specifi- cally, when memory is accessed, the likelihood of the next access being nearby is high. Thus, pulling a page from disk to memory due to a memory access not only retrieves the memory to be accessed, but likely the next memory element being accessed.

Any of the devices listed in the table could cause potential problems if it causes virtual memory to thrash, continuously swapping in and out pages from physical memory. This would happen if the locality principle is violated by the device. Careful design and sufficient physical memory will almost always solve this problem. Not typically, although it is possible. Online chat is dominated by transactions, not the size of those transactions.

When data throughput dominates numbers of transactions, then polling could potentially be a reasonable approach. In most situations, a mixture of the two approaches is the most pragmatic approach. Specifically, use commands to handle interactions and memory to exchange data. Large, concurrent data reads and writes. Large numbers of small, concurrent transactions. Ranking systems with benchmarks is generally not useful. However, understanding trade-offs certainly is.

Although benchmarks help simulate the environment of a system, nothing replaces live data in a live system. CPUs are particularly difficult to evaluate outside of the system where they are used.

No, unless computations force the system to access disk frequently. For the RAID 1 system with redundancy to fail, both disks must fail. The probability of both disks failing is the product of a single disk failing. The result is a substantially increased MTBF. In all applications, decreasing the likelihood of data loss is good.

However, online database and video services are particularly sensitive to resource availability. When such systems are offline, revenue loss is immediate and customers lose con- fidence in the service.

The trade-off is storage cost. This must be viewed both in terms of the cost of disks, but also power and other resources required to keep the disk array running. In the previous applications, large online services like database and video ser- vices would definitely benefit from RAID 3. Video and sound editing may also benefit from RAID 3, but these applications are not as sensitive to availability issues as online services. DEE8 b. F b. Specifically, RAID 3 accesses every disk for every data write no matter which disk is being written to.

For smaller writes where data is located on a single disk, RAID 4 will be more efficient. This eliminates the parity disk as a bottleneck during disk access. For applications with high numbers of concurrent reads and writes, RAID 5 will be more efficient. In contrast, RAID 4 and 5 con- tinue to access only existing values of data being stored. For two disks, there is no difference. IOPS Bottleneck? By benchmarking in a full system, or executing an actual application, an engineer can see actual numbers that are far more accurate than approximate calculations.

All three applications perform some kind of transaction processing, but those Sol A web server processes numerous transac- tions typically involving small amounts of data. Thus, transaction throughput is critical.

A database server is similar, but the data transferred may be much larger. A bioinformatics data server will deal with huge data sets where transactions pro- cessed is not nearly as critical as data throughput. When identifying the runtime characteristics of the application, you are implic- itly identifying characteristics for evaluation.

For a web server, transactions per second is a critical metric. For the bioinformatics data server, data throughput is critical. For a database server, you will want to balance both criteria. You may also find advertisements in periodicals from your professional societies or trade journals. You should be able to identify one or more candidates using the cri- teria identified in 6.

You can use the same data and characteristics here. Remember that the Sun Fire x has multiple configurations. You should con- sider this when you perform your evaluation. Find similar measurements for the server that you have selected. Most of this data should be available online. If not, contact the company providing the server and see if such data is available. If you design your spreadsheet carefully, you can simply enter a table of data and make comparisons quickly.

This is exactly what you will do in industry when evaluating systems. There are a number of test suites available that will serve your needs here. Virtually all of them will be available online. Look for benchmarks that generate transactions for the web server, those that generate large data transfers for the bioinformatics server, and a combination of the two for the database server.

The ratio of the num- ber of drives replaced in the first scenario to the number replaced in the second should give us the multiple that we want: 7 Years 10 Years a.

The objective of the customer is not known. Thus, improv- ing any performance metric by nearly doubling the cost may or may not have a price impact on the company. Most HTTP traffic is small, so the network is not as great a bottleneck as it would be for large data transfers. RAID 0 may be an effective solution. However, RAID 1 will almost certainly not be an effective solution. Increased availability makes our prod- uct more attractive, but a 1.

The cost of this gain is 0. As an online backup provider, availability is critical. However, online backup is more appealing when services are provided quickly making RAID 0 appealing. Will increasing throughput in the disk array for long data reads and writes result in performance improvements for the system?

The network will be our throughput bottleneck, not disk access. RAID 0 will not help much. RAID 1 has more potential for increased revenue by making the disk array avail- able more. For our original configuration, we are losing between 12 and 19 disks per to every 7 years.

If the system lifetime is 7 years, the RAID 1 upgrade will almost certainly not pay for itself even though it addresses the most critical property of our system.

Over 10 years, we lose between 30 and 50 drives. If repair times are small, then even over a year span the RAID 1 solution will not be cost effective. Simulations tend to run days or months. Thus, losing simulation data or having a system failure during simula- tion are catastrophic events. Availability is therefore a critical evaluation parameter.

Additionally, the disk array will be accessed by parallel processors. Through- put will be a major concern. The primary role of the power constraint in this problem is to prevent simply maximizing all parameters in the disk array.

Adding additional disks and control- lers without justification will increase power consumption unnecessarily. Thus, you will need multiple copies of your data and may be required to move those copies offsite. This makes none of the solutions optimal. RAID or a second backup array provides high-speed backup, but does not provide archival capabilities. Magnetic tape allows archiving, but can be exception- ally slow when comparing to disk backups.

Online backup automatically achieves archiving, but can be even slower than disks. Most other param- eters that govern selection of a system are relatively well understood—portability and cost being the primary issues to be evaluated.

The purpose is to get students to think about parallelism present in their daily lives. The answer should have at least 10 activities identified.

The answer should consider the amount of overlap provided through parallelism and should be less than or equal to if no parallelism was possible to the original time computed if each activity was carried out serially. Solution 7. So part A asks to compute the speedup factor, but increasing X beyond 2 or 3 should have no benefits.

While we can perform the com- parison of low and high on one core, the computation for mid on a second core, and the comparison for A[mid] on a third core, without some restructuring or specula- tive execution, we will not obtain any speedup. The answer should include a graph, showing that no speedup is obtained after the values of 1, 2 or 3 this value depends somewhat on the assumption made for Y. Again, given the current code, we really cannot obtain any benefit from these extra cores.

But if we create threads to compare the N elements to the value X and perform these in parallel, then we can get ideal speedup Y times speedup , and the comparison can be completed in the amount of time to perform a single comparison. This problem illustrates that some computations can be done in parallel if serial code is restructured.

But more importantly, we may want to provide for SIMD operations in our ISA, and allow for data-level parallelism when performing the same operation on multiple data items. The first instruction is executed once, and the loop body is executed times. Version 1—17, cycles Version 2—22, cycles Version 3—20, cycles 7. These will f3 in the current iteration and f1 in the next iteration. The preferred solution will try to utilize the two nodes by unrolling the loop 4 times this already gives you a substantial speedup by elimi- nating many loop increment, branch and load instructions.

The loop body run- ning on node 1 would look something like this the code is not the most efficient code sequence : DADDIU r2, r0, L. D f1, —16 r1 L. D f2, —8 r1 loop: ADD. D f3, f2, f1 ADD. D f4, f3, f2 Send 2, f3 Send 2, f4 S. D f4, 8 r1 Receive f5 ADD. D f6, f5, f4 ADD. D f1, f6, f5 Send 2, f6 Send 2, f1 S. D f1 32 r1 Receive f2 S. D f4, f3, f2 ADD. D f6, f5, f4 S. The loop takes cycles, which is much better than close to 18K. But the unrolled loop would run faster given the current send instruction latency.

This illustrates why using distributed message passing is difficult when loops contain loop-carried dependencies. In part A the student is asked to compute Sol We when forming the lists, we spawn a thread for the computation of left in the MergeSort code, and spawn a thread for the computation of the right.

But if we had m cores, we could perform sort- ing using a very different algorithm. So this is one possible answer for the question. It is known as parallel comparison sort. Various comparison sort algorithms include odd-even sort and cocktail sort.

We assume that we do not have to reheat the oven for each cake. Mix ingredients for Cake 2 in bowl. Finish baking Cake 1. Empty cake pan. Fill cake pan with bowl contents for Cake 2 and bake Cake 2. Mix ingredients in bowl for Cake 3.

Finish baking Cake 2. Fill cake pan with bowl contents for Cake 3 and bake Cake 3. Finish baking Cake 3. We will name them A, B and C. Mix ingredients for Cake 2 in bowl A. Empty cake pan A. Fill cake pan A with contents of bowl A for Cake 2.

Mix ingredients in bowl A for Cake 3. Fill cake pan A with contents of bowl A for Cake 3. The point here is that we cannot carry out any of these items n parallel because we either have one person doing the work, or we have limited capacity in our oven. The time to bake 1 cake, 2 cakes or 3 cakes is exactly the same. Given that we have multiple processors or ovens and cooks , we can exe- cute instructions or cook multiple cakes in parallel.

The instructions in the loop or cooking steps may have some dependencies on prior instructions or cooking steps in the loop body cooking a single cake. Data-level parallelism occurs when loop iterations are independent i.

Task-level parallelism includes any instructions that can be computed on parallel execution units, are similar to the independent operations involved in making multiple cakes. The multiplications and additions associated with a single element in C are depen- dent we cannot start summing up the results of the multiplications for a element until two products are available.

So in this question, the speedup should be very close to 4. Each update would incur the cost of a cache miss, and so will reduce the speedup obtained by a factor of 3 times the cost of servicing a cache miss.

The easiest way to solve the false sharing problem is to compute the elements in C by traversing the matrix across columns instead of rows i. These elements will be mapped to different cache lines. This will eliminate false sharing.

First, the write will gener- ate a read from memory of the L2 cache line, and then the line is written to the L1 cache. The data updated in the block is updated in L1 and L2 assuming L1 is updated on a write miss. Specific to the coherency pro- tocol assumed, on the first read from another node, a cache-to-cache transfer takes place of the entire dirty cache line.

The other two reads can be serviced from any of the caches on the two nodes with the updated data. The accesses for the other three writes are han- dled exactly the same way. The key concept here is that all nodes are interrogated on all reads to maintain coherency, and all must respond to service the read miss.

The directory controller will then initiate the cache-to-cache transfer, but will not need to bother the L2 caches on the nodes where the line is not present. All state updates are handled locally at the directory. For the last two reads, again the single directory is interrogated and the directory controller initiates the cache-to-cache transfer. But only the two nodes participat- ing in the transfer are involved.

ludiburi1986's Ownd

0コメント

1000 / 1000