SPARC64-III User's Guide Errata (V2.0) Feb. 16 1999 *** This file is located at /am/pub/620-00205-A/s_64_errata.txt (ASCII file) in HAL internal network. ### Addition in 5.1.7.2 "ftt = unfinished_FPop" (page 78) The followings are the unfinished_FPop trapping cases for all of floating point instructions. The instructions which are not listed in the tables below never generate the fp_exception other trap with ftt=unfinished_FPop. They are two categories: one is denormal operand trapping cases, and the other is denormal result trapping cases. Denormal operand trapping cases: +-----------+-----------------------------------------------------------------+ | | Number of denormal operands | |Instruction+-------------------------------------------+---------------------+ | | 1 | 2 | +===========+===========================================+=====================+ |FSQRT(sd) |unfinished_FPop when operand is positive |N/A | +-----------+-------------------------------------------+---------------------+ |FDIV(sd) |unfinished_FPop when the other operand is |unfinished_FPop | | |not special value (NaN, Infinity, Zero) and| | | |there is no *overflow* | | +-----------+-------------------------------------------+---------------------+ - FDIV(sd) overflow cases: Term definitions Exp1 = exponent of src1 (0 for denormal) Exp2 = exponent of src2 (1 for denormal) Frc1 = fraction of src1 (0 for denormal) Frc2 = fraction of src2 (1 for denormal) Bias = 0x7f(single), 0x3ff(double) MaxExp = 0xff(single), 0x7ff(double) FDIV(sd) overflow detection is based on difference between src1's exponential value and src2's. (i.e. Exp1 - Exp2 + Bias) The complete formulas are as follows. [Src1=denormal, Src2=denormal] 0x1 < 0x0 - 0x1 + Bias < MaxExp ==> unfinished_FPop [Src1=denormal, Src2=normal] 0x1 < 0x0 - Exp2 + Bias < MaxExp ==> unfinished_FPop 0x0 - Exp2 + Bias = 0x1 ==> unfinished_FPop 0x0 - Exp2 + Bias < 0x1 ==> when FSR.UFM = 0 unfinished_FPop when FSR.UFM = 1 IEEE_754_exception(underflow) [Src1=normal, Src2=denormal] MaxExp < Exp1 - 0x1 + Bias ==> IEEE_754_exception(overflow) MaxExp = Exp1 - 0x1 + Bias ==> IEEE_754_exception(overflow) 0x0 < Exp1 - 0x1 + Bias < MaxExp ==> unfinished_FPop Denormal result trapping cases: +-----------+----------------------------------------------+ |Instruction| Result | +===========+==============================================+ |FDIV(sd) |unfinished_FPop when *underflow* and FSR.UFM=0| +-----------+----------------------------------------------+ - FDIV(sd) underflow cases: FDIV(sd) underflow detection is based on difference between src1's exponential value and src2's. (i.e. Exp1 - Exp2 + Bias) The complete formulas are as follows. [Src1=normal, Src2=normal] MaxExp < Exp1 - Exp2 + Bias ==> IEEE_754_exception(overflow) MaxExp = Exp1 - Exp2 + Bias ==> when Frc1 >= Frc2 IEEE_754_exception(overflow) 0x1 < Exp1 - Exp2 + Bias < MaxExp ==> no exception Exp1 - Exp2 + Bias = 0x1 ==> when Frc1 >= Frc2 no exception when Frc1 < Frc2 and FSR.UFM = 0 unfinished_FPop when Frc1 < Frc2 and FSR.UFM = 1 IEEE_754_exception(underflow) Exp1 - Exp2 + Bias < 0x1 ==> when FSR.UFM = 0 unfinished_FPop when FSR.UFM = 1 IEEE_754_exception(underflow) ### Addition in 5.2.3 Processor Interrupt Level (PIL) Register (page 86) Programming note: It is possible that SCHED_INT might not be detected precisely due to the existence of a pending write to the PIL as follows. Assume SCHED_INT (ASR22) bit n is set, where x < n <= y: (1) wr %pil ! set value x into %pil (2) wr %pil ! set value y into %pil If there is no instruction between (1) and (2), CPU might not recognize the scheduled interrupt (level=n) at all. Syncing instruction such as TN is required between (1) and (2) to guarantee the scheduled interrupt (level=n) happens before (2). If the src1 and src2 fields of (1) instruction specifies "%g0 + immediate", having minimum of 4 instructions between (1) and (2) also works. ### Addition in 5.2.11.1 Hardware Mode Register (ASR18) (page 94) Programming note: When the DPE (Data Prefetch Enable) bit is set, the hardware works as if the D1 cache line size is doubled (128 Bytes). Currently the OS cannot bind some specific applications to the CPU's whose DPE bit is set in order to maximize the benefits of the data prefetch feature to only these selected applications. The only way to make use of this function is to set the DPE bit of all CPU's in the system by the OS. ### Correction in 5.2.11.12 State Control Register (ASR31) Table 19 (page 104) The following sentence in the description of SEQUENTIAL_MODE (Bit 0, SM) in Table 19 "State Control Register (ASR31) Field Definitions" on page 104 should be changes as below. (Old) The SM bit also disables speculative instruction prefetching. Instrucion accesses occur only when an instruction can be issued. Note that block prefetch around the instruction is stilll possible. (New) The SM bit doesn't affect speculative instruction prefetching. ### Addition in 5.2.11.12 State Control Register (ASR31) (page 105) When the CPU takes async_error trap (TT=0x063), bit 51 (DISABLE_ASYNC_ERROR _TRAP) is set by the hardware so that the same trap doesn't occur in the trap handler. It's the software's resposibilty to reset the bit after the asynchronous error trap handling. Level-1 Cache disable-way bits (bit 35:32 for I1-Cache and bit 39:36 for D1- Cache) are prepared for basically processor debugging purposes. Therefore it is not recommended to set the value "1" to any of these bits. Especially, I1-Cache cannot be disabled by setting "1" to all of bit 35:32. The same thing for D1-Cache. Disabling all four ways of each Level-1 Cache causes the processor to hang. ### Addition in 6.1.2 Instruction Prefetch (page 108) One of the prefetch actions by the CPU is that when the I0 cache miss happens at line N of the cache (the line size is 64 Bytes), the next cache line (N+1) is prefetched into the I0 cache. ### Addition in 6.1.3.1 "Other Serializing Instructions" (page 111) The following events also serialize the CPU: - All traps EXCEPT illegal_instruction clean_window programmed_emulation trap spill_n_normal(n=0..7) spill_n_other(n=0..7) fill_n_normal(n=0..7) fill_n_other(n=0..7) trap_instruction (only when "ta %g0+simm13") ### Correction in 7.5.2 "Trap Type" Table 30 (page 150) In Table 30 "Exception and Interrupt Requests, by Priority", "asnc_error" (6th line from the top) should be read as "async_error". ### Addition in 8.1.1 "SPARC64-III Hardware Memory Models" (page 172) The following table shows the load/store ordering including cacheable and noncacheable combination cases in each memory model in SPARC64-III. The load/store order means the completion order of load/store which is visible to all other processors and it is defined as the order in which the issuing processor finishes the transactions from/to the UPA bus. +----------------+--------------+-----+-----+-----+ | Program order | Cacheability |HLSO |HTSO |HSTO | +----------------+--------------+-----+-----+-----+ | | CH -> CH | O | 0 | X | | +--------------+-----+-----+-----+ | | CH -> NC | O | 0 | X | | Load -> Load +--------------+-----+-----+-----+ | | NC -> CH | O | 0 | X | | +--------------+-----+-----+-----+ | | NC -> NC | O | 0 | X | +----------------+--------------+-----+-----+-----+ | | CH -> CH | O | O | O | | +--------------+-----+-----+-----+ | | CH -> NC | O | O | O | | Load -> Store +--------------+-----+-----+-----+ | | NC -> CH | O | O | O | | +--------------+-----+-----+-----+ | | NC -> NC | O | O | O | +----------------+--------------+-----+-----+-----+ | | CH -> CH | O | X | X | | +--------------+-----+-----+-----+ | | CH -> NC | O | X | X | | Store -> Load +--------------+-----+-----+-----+ | | NC -> CH | X | X | X | | +--------------+-----+-----+-----+ | | NC -> NC | O | X | X | +----------------+--------------+-----+-----+-----+ | | CH -> CH | O | O | O | | +--------------+-----+-----+-----+ | | CH -> NC | O | O | O | | Store -> Store +--------------+-----+-----+-----+ | | NC -> CH | O | O | X | | +--------------+-----+-----+-----+ | | NC -> NC | O | O | O | +----------------+--------------+-----+-----+-----+ - "CH" means cacheable access, and "NC" means noncacheable access. - "O" means in order and "X" means out of order. - CAS(X)A and FLUSH instructions cause the processor and the memory to sync anyway. Regardless of the memory model, any kind of load/store order before and after these instructions are maintained. - The effects of PREFETCH(A) instruction are not software-visible except the data_access_error case. Therefore any memory prefetch ordering by PREFETCH(A) instruction is not guaranteed in any memory model. - The ordering of MEMBAR instruction itself is always maintained in order before and after all kinds of memory instruction in all memory models. ### Addition in 8.4.3.1 "Ordering MEMBAR Instructions" (page 180) The follwoing table shows the memory order which can be imposed by ordering MEMBAR instructions. +----------------+--------------+-----+-----+-----+ | MEMBAR variant | Cacheability |HLSO |HTSO |HSTO | +----------------+--------------+-----+-----+-----+ | | CH -> CH | O | 0 | @ | | +--------------+-----+-----+-----+ | | CH -> NC | O | 0 | @ | | #LoadLoad +--------------+-----+-----+-----+ | | NC -> CH | O | 0 | @ | | +--------------+-----+-----+-----+ | | NC -> NC | O | 0 | @ | +----------------+--------------+-----+-----+-----+ | | CH -> CH | O | O | O | | +--------------+-----+-----+-----+ | | CH -> NC | O | O | O | | #LoadStore +--------------+-----+-----+-----+ | | NC -> CH | O | O | O | | +--------------+-----+-----+-----+ | | NC -> NC | O | O | O | +----------------+--------------+-----+-----+-----+ | | CH -> CH | O | @ | @ | | +--------------+-----+-----+-----+ | | CH -> NC | O | @ | @ | | #StoreLoad +--------------+-----+-----+-----+ | | NC -> CH | X | X | X | | +--------------+-----+-----+-----+ | | NC -> NC | O | @ | @ | +----------------+--------------+-----+-----+-----+ | | CH -> CH | O | O | O | | +--------------+-----+-----+-----+ | | CH -> NC | O | O | O | | #StoreStore +--------------+-----+-----+-----+ | | NC -> CH | O | O | X | | +--------------+-----+-----+-----+ | | NC -> NC | O | O | O | +----------------+--------------+-----+-----+-----+ - "O" means the memory model itself guarantees the order. Therefore no MEMBAR is needed to ensure the ordering. Especially, MEMBAR#LoadStore is not needed under any type of memory models. The hardware treats it as NOP. - "@" means MEMBAR gurantees the order. - "X" means the order is not guaranteed in spite of each ordering MEMBAR. To ensure those orders, use MEMBAR#Sync. ### Correction in 8.4.6 "Hardware Primitives for Mutual Excusion" (page 183) The first paragraph on page 183 should be changed as below. When the hardware mutual-exclusion primitives address I/O locations, the attempts to access the locations result in data_access_exception trap with an illegal access to noncacheable page. See 7.7, "Exception and Interrupt Descriptions: data_access_exception" page 163. for details. In addition, with proper system design, the atomicity of hardware mutual-exclusion primitives can be guaranteed for not only processor memory references but also when the memory location is simultaneounsly being addressed by an I/O device such a channel or DMA. The third paragraph should be deleted entirely. ### Addition in 8.4.7 "Synchronizing Instruction and Data Memory" (page 184) The followings discuss the SPARC64-III implementation of synchronization of instruction and data memory. SPARC64-III cache coherency between the instruction cache and data cache is as follows. (See Appendix M "Cache Organization" for SPARC64-III cache organization.) - When I1-Cache has the line, D1-Cache cannot have the line in modified state. - When D1-Cache has the line in modified state, I1-Cache cannot have the line. - I0-Cache is totally inclusive to I1-Cache. When I0-Cache has a line, I1-Cache also has the line. When I1-Cache doesn't have a line, I0-Cache also doesn't have the line. The followings are the essential examples of the cache coherent actions. Example-1: Initial: I1-Cache(invalid) D1-Cache(modified) Access: Instruction Fetch Action: D1-Cache data is copied back to U2-Cache. D1-Cache state is changed to clean. I1-Cache gets data from U2-Cache (most up-to-data) in clean state. Example-2: Initial: I1-Cache(clean) D1-Cache(invalid) Access: Load Action: D1-Cache gets data from U2-Cache in clean state. (No action to I1-Cache) Example-3: Initial: I1-Cache(clean) D1-Cache(invalid) Access: Store Action: I1-Cache data is invalidated. D1-Cache gets data from U2-Cache in modified state. Store is performed into D1-Cache. Example-4: Initial: I1-Cache(clean) D1-Cache(clean) Access: Store Action: I1-Cache data is invalidated. D1-Cache gets store permission and state is changed to modified. Store is performed into D1-Cache. By this implementation, the instruction coherency is maintained by each store itself without the intervention of FLUSH instruction except for the instructions which are already fetched into I-buffer (up to 12 instructions) and the CPU pipeline (up to 64 instructions) when the store is committed. Therefore, after modifying the instructions by stores, if one of the following things happens, it is guaranteed that CPU executes instructions which have the effects of all prior stores. - More than 76 instructions apart from the last store. - After a FLUSH instruction (Just one FLUSH instruction is enough no matter how many memory address locations are modified by prior stores.) - After MEMBAR#Sync instruction - After a pair of a syncing instruction and a done/retry instruction. (It is OK to have any number of instructions between a syncing instruction and a done/retry instruction.) Any of the above cases flushes the instructions which are in both I-buffer and the CPU pipeline. Because the exection of FLUSH instruction is time-consuming, the followings are recommended for software to maximize the performance. - Don't use FLUSH instruction. Use one MEMBAR#Sync instruction only when it is necessary to ensure the local I-Cache coherency for any size of area. Regarding the propagation of the effect of FLUSH instruction to all of the processor in the system, it is always guaranteed even if MEMBAR#Sync is used in place of FLUSH because of the following reasons. - When one processor tries to do a store to some memory location, the copies in the other processors' caches are invalidated before the store completes. - In a processor, the local icache coherency is maintained at least 76 instructions after a store which modifies instruction data. This is enough to satisfy the requirement of "eventual coherency across all caches in a multiprocessor system after FLUSH instruction execution in a processor in the system." ### Addition in A.25 "Load Floating-point" (page 260) Implementation Note: Unlike the SPARC-V9 specification, in SPARC64-III, LDDF causes an LDDF_mem_ address_not_aligned exception if the effective memory address is not word- aligned. (The SPARC-V9 specifies LDDF should cause an mem_address_not_aligned exception in this case.) ### Correction in A.42.3 "SPARC64-III PREFETCH Behavior" (page 300) The first sentense on page 300 should be changed as below. 3. If translation succeeds and if the referenced location is non-cacheable or already in the D1 Cache, the D1 cache will complete the PREFETCH. That means nothing is done by the hardware in terms of data prefetch. ### Addition in A.52 "Store Floating-point" (page 316) Implementation Note: Unlike the SPARC-V9 specification, in SPARC64-III, STDF causes an STDF_mem_ address_not_aligned exception if the effective memory address is not word- aligned. (The SPARC-V9 specifies STDF should cause an mem_address_not_aligned exception in this case.) ### Correction in O.3 "Processor State after Reset and in RED_state" Table 91 (page 437). In Table 91 "CPU State after Reset and in RED_state" on pgae 437, POR value for D_UPA of SCR (ASR31) should be changed from "1" to "0".