Technologic Systems designs for reliability in our embedded ARM based computers. When the first Technologic Systems' products with NAND flash were designed, SLC NAND flash was selected over MLC NAND flash to ensure the highest reliability. Initial product testing showed that SLC NAND flash could endure well over the published 100,000 write/erase cycles, especially when used with the NAND user-space device driver XNAND. However since that time the flash industry has quietly shrunk the die used in SLC NAND, and the endurance has dropped significantly. To address the decreasing endurance of SLC NAND, Technologic Systems is rolling out a significant upgrade to the XNAND user-space device driver, XNAND2.
XNAND2 has been designed to be best in breed for embedded NAND flash management. It provides the same NAND flash management features as found in XNAND, including features like ultra-reliable system bootup, guaranteed data integrity, no reduction in file system size over time, and atomic write operations to name a few. Additionally XNAND2 introduces wear-leveling which will significantly extend the endurance of the NAND flash. XNAND2 also updates the mechanisms for error detection and recovery to ensure the overall reliability of the system. After thorough application and algorithm level testing, XNAND2 will be deploying on several Technologic Systems' products in early 2017.
XNAND2 Performs Under Test
Aggressive write/erase cycling to a single logical block of NAND, while not indicative of real world application use, can be a great early indicator of the quality and endurance of NAND flash devices, and the software drivers that manage them. Under aggressive stress testing the original SLC NAND devices designed into Technologic Systems' products could endure well over 100,000 write/erase cycles, especially when combined with XNAND technology. On the other hand, aggressive stress tests will wear through a block of the lower endurance NAND flash devices being manufactured today in less than 40,000 write/erase cycles. After identifying this change in SLC NAND flash endurance, Technologic Systems began designing and testing XNAND2 as a drop in replacement to XNAND. XNAND2 can endure without failure under the same aggressive stress testing, and evaluated well under corner case testing meant to stress all aspects of the algorithm.
In order to verify XNAND2's real world performance, Technologic Systems set up a test environment meant to mimic real world use cases in a controlled environment. All of the tests were run on Technologic Systems' TS-4200 systems which were populated with one of three different SLC NAND devices:
- an SLC NAND device manufactured before endurance of the SLC NAND began to fall off
- a newer lower endurance device from Samsung
- a newer lower endurance device from Toshiba with a longer guaranteed product availability
All three of these NAND devices have been or are currently used on various Technologic Systems' products. XNAND2 was tested on the low endurance NAND devices in comparison to the original XNAND on both low endurance and high endurance SLC NAND, to show that XNAND2 can restore the performance of the low endurance NAND to be the same as the original design of XNAND.
To mimic a real world environment, the file system on the TS-4200 was configured with a section of static data, and a section for dynamic write/erase activity. Most systems contain some static data in the NAND flash storage, including OS files and user files that are not changed frequently, or perhaps at all during the life of the product. Systems also include dynamically changing files like from data logging activity or other system mechanisms like file system tables that need to be updated regularly. It is these dynamic files that typically contribute to flash wear out and failure by creating a hotspot of write/erase activity on a NAND flash device. XNAND2 was developed to prevent these hotspots from developing. The static data section filled 240 MBytes of the addressable storage space, and the data there was checked periodically to ensure that no data corruption occurred. In the remaining storage space, 6.75MBytes of variable data was written, verified, and erased repeatedly to create a hotspot of dynamic file activity.
After running for more than two months the units with XNAND2 are still going without any failures. In that same time period all of the units running XNAND on the low endurance NAND devices had failed. The systems running XNAND on the older high endurance SLC NAND are also still going without failure, showing that XNAND2 is as reliable as the original solution. The table below shows the average number of writes to the dynamic region survived by the devices. The failed devices failed from permanent data loss due to NAND flash wear out.
|High Endurance SLC NAND||Low Endurance SLC NAND: Toshiba
(longest promised availability)
|High Endurance SLC NAND: Samsung|
|XNAND||Still Running||Failed after 130 million writes||Failed after 1.3 billion writes|
|XNAND2||N/A||Still Running||Still Running|
What is XNAND2
XNAND2 performs well under testing due to Technologic Systems' design improvements to the flash management mechanisms. Like XNAND, XAND2 is a user-space device driver which manages the NAND flash device for the system. XNAND2 is a drop in replacement for Technologic Systems' XNAND nandctl (for more details on XNAND refer to our white paper Industrial Grade Flash Reliability with RAID-like XNAND Driver). The change from XNAND to XNAND2 is transparent from the application's perspective, and provides the needed improvements to compensate for the new lower endurance SLC NAND devices.
Like XNAND, XNAND2 presents 256 MBytes of addressable storage to the system, and ensures that the data stored there will always be retrieved accurately and without error. Behind the scenes XNAND2 implements several key features for ensuring SLC NAND flash endurance on the devices available today including wear leveling, and advanced error detection. Also, by utilizing a physical NAND device that has 512 MBytes of available storage, XNAND2 includes plenty of physical storage area to allow for RAID5-like data redundancy, while leaving plenty of overhead available for the wear-leveling algorithm.
The wear leveling algorithm in XNAND2, like all flash management wear leveling solutions, extends the endurance of the NAND device by reducing the number of write/erase cycles a single block of flash storage experiences. The wear leveling algorithm maps the system's logical addressable storage to different physical storage locations and then ensures that the write/erase activity is shared out more evenly across the physical storage. If a system repeatedly writes and rewrites the same logical location in the addressable storage, it is the responsibility of the wear-leveling algorithm to ensure that write and erase activity is shared amongst all of the physical blocks of NAND storage so that no one block is worn through.
There are two major types of wear leveling algorithms: static and dynamic. The wear-leveling used by XNAND2 is a static wear leveling algorithm which provides the best overall NAND endurance possible. A dynamic wear leveling algorithm only rotates write/erase activity in areas of storage that are erased often or are currently considered empty thereby smoothing a single hotspot into a warm region on the device. The dynamic algorithm will extend the life of a NAND device, but the warm region will wear through prematurely and cause system failures. On the other hand, a static wear leveling algorithm evaluates the state of all of the blocks on the flash device including used blocks which contain data and erased blocks which are empty. The static algorithm then attempts to evenly distribute any write/erase activity across the full device, shifting data out of used blocks when necessary to keep all blocks at the same level of wear.
XNAND2 keeps the wear of all blocks on the NAND device within a few hundred write/erase cycles of each other under all conditions. The wear leveling algorithm begins by selecting the block with the fewest write/erase cycles, or the least worn block available from the erased block pool whenever a write or even rewrite request is made by the system. In the case of a rewrite request, the block containing the original data is marked as discarded once the modified copy of data was written to a new block.
Garbage Collection Keeps Blocks Even
To prevent the number of discarded blocks from overwhelming the system, there is a garbage collection algorithm running regularly in the background. Garbage collection erases discarded blocks of memory so that they are available for requests from the system. The garbage collection algorithm also regularly examines the used blocks to evaluate their level of wear. No good blocks of memory are protected from this examination and all physical blocks of the device regardless of how they are currently being used, will be included in the garbage collector's examination. When the blocks in the erased block pool start to have more wear than the used blocks, the garbage collection will carefully copy the data from the least worn used block into the most worn erased block. The previously used block is then erased, making it available to be used by the wear leveling algorithm. By utilizing a 512 MByte NAND flash to hold 256 MByte of addressable storage, there are always plenty of erased blocks available.
Keeping Track of Blocks with Translation Tables
The wear leveling algorithm in XNAND2 manages the mapping of the logical addressable storage to different physical storage locations using a translation table. The translation table is carefully managed to ensure that it always reflects the true locations of all of the data in the system. After there has been write activity the translation table needs to be updated. Rather than wearing the NAND device by rewriting the table in the same location, a new translation table is created. To facilitate the detection of the current translation table, ECC protected markers are placed in special bits on the edges of a block of NAND. These markers indicate the presence and status of a translation table. When a new translation table is needed, the old translation table is marked with a flag to indicate it is being retired but that it is still the active table. The new table is created by copying the current table with the required modifications. Once created, the new table is marked as current and the old table is marked as discarded and erased in the next garbage collection cycle. If there were interruptions in this process XNAND2 would audit the system and make any corrections required for correct operation.
Data Accuracy Guaranteed
In addition to managing the wear leveling activities, XNAND2 ensures that the data stored on the NAND device can be read back accurately. Like all NAND flash management solutions, XNAND2 includes an error correcting code (ECC) that can detect if an accidental bit flip has occurred. If there are small errors the ECC algorithm will correct them as the data is read. However NAND devices, especially the newer lower endurance NAND devices, are easily corrupted, frequently creating more errors than an ECC algorithm can correct. In the worst cases of data corruption the ECC algorithm can become overwhelmed and incapable of detecting the data corruption. To increase the reliability of XNAND2 over the average NAND management solutions, Technologic Systems has included additional mechanisms to detect and correct significant data corruption. In addition to storing error correction codes, XNAND2 also stores a checksum and a RAID5-like redundant copy of the data.
Every time data is written to the NAND flash, the ECC bits and the checksum are calculated and stored with the data. An additional calculation also allows a second copy of the data to be retained. The addition of this checksum allows the read mechanism to complete a second check of the data in addition to the ECC algorithm. When the ECC algorithm detects errors it cannot correct, or if the corruption is bad enough that the error would otherwise miss ECC detection, the checksum is able to catch the error. When the checksum detects a failure, XNAND2 reverses the calculation used to store the backup copy of the data and provides correct data up to the system read request. This mechanism ensures that correct data is always provided to the higher layers of the system. Once the checksum and data recovery algorithms have successfully recovered the data, the block with the error is marked as a bad block. Bad blocks are not available to the wear leveling algorithm.
A read request is not the only time the checksum and RAID5-like redundant data are checked. In addition to detecting errors on reads, the entire system is audited at power on. A resilvering process will start shortly after a system power up which will evaluate all of the data in the addressable storage against the backup copy. Any accumulated errors are proactively corrected with this process. Through checksums, data redundancy, and having an excess of available memory, XNAND2 is able to perform more reliably and longer than other wear-leveling based NAND management solutions.
Despite the industry's reduction in overall SLC NAND flash endurance, Technologic Systems is able to continue to provide a highly reliable embedded system with our XNAND2 solution. In addition to the application level testing described above, Technologic Systems engineers also invested many hours in evaluating the details of the algorithm and testing for corner cases and possible algorithm weaknesses. XNAND2 is a highly reliable algorithm ready for the most aggressive of environments. Technologic Systems is rolling out XNAND2 on the TS-4200, TS-4700, TS-4800, TS-4500, TS-7550, TS-7553, TS-7558 in early 2017. Documentation for these products will be updated with additional information on using XNAND2, and our support team will be standing by to help our customers with any questions or concerns.
About the Author
Eliza Nelson Hardware Design Engineer
Eliza's a graduate of Arizona State University Electrical Engineering & Computer Science programs. She has 14 years of experience in design, field applications engineering, and technical marketing.