While these machines still constitute servers in a raw sense, it would take a brave Technology Officer to put their faith in trusting these types of servers to fulfil the ITS requirements of their company to these white boxes. This guide demonstrates what differentiates business class servers from the typical white box server that you can build from off the shelf components and highlights some of the many factors of a server’s design that needs to be carefully considered in order to provide reliable services for business.
Form Factor
Servers come in all shapes and sizes. The tower server is designed for organisations or branch offices whose entire infrastructure consists of a server or two. From the outside, they wouldn’t look out of place on or under someone’s desk but the components that make up the server’s guts are often of a higher build quality than workstation components. Tower cases are generally designed to minimise cost whilst providing smaller businesses some sense of familiarity with the design of the enclosure.
For larger server infrastructures, the rack mount case is used to hold a server’s components. As the name suggests, rack mount servers are almost always installed within racks and located in dedicated data rooms, where power supply, physical access, temperature and humidity (among other things) can be closely monitored. Rack mount servers come in standard sizes they are 19 inches in width and have heights in multiples of 1.75 inches, where each multiple is 1 Rack Unit (RU). They are often designed with flexibility and manageability in mind.
Lastly, the blade server is designed for dense server deployment scenarios. A blade chassis provides the base power, management, networking and cooling infrastructure for numerous, space efficient servers. Most of the top 500 supercomputers these days are made up of clusters of blade servers in large data centre environments.
Processors
With the proliferation of Quad Core processors in the mainstream performance sector of today’s computing landscape, the main difference between servers and workstations that you will see comes down to the support for multiple sockets. Consumer class Core 2 and Phenom based systems are built around a single socket designs that feature multiple cores per socket and cannot be used in multi socket configurations. Xeon and Opteron processors on the other hand, provide interconnects that allow processes to be scheduled across multiple separate processors featuring multiple cores to contribute towards the total processing power of a server. It’s not uncommon to see quad socket, Four Core processors in some high end servers providing a total of 16 processing cores at upwards of 3.0GHz per core. The scary thing is that Six Core and Eight Core processors are just around the corner...
The other main difference that you see between consumer and enterprise processors is the amount of cache that is provided. Xeon and Opteron processors often have significantly larger Level 2 and Level 3 caches in order to reduce the amount of data that has to be shifted to memory, generally resulting in slightly faster computation times depending on the application. A server’s form factor will also have an impact on the type of processor that can be used. For instance, blade servers often require more power efficient, cooler processors due to their increased deployment density. Similarly, a 4RU server may be able to run faster and hotter processors than a 1RU server from the same vendor.
Memory
While the physical RAM modules that you see in today’s servers don’t differ dramatically from consumer parts, there are numerous subtle differences to the memory subsystems that provide additional fault tolerance features. Most memory controllers feature Error Checking and Correction (ECC) capabilities, and the RAM modules installed in such servers need to support this feature. Essentially, ECC capable memory performs a quick parity check before and after read or write operation to verify that the contents or memory has been read or written properly. This feature minimised the likelihood of memory corruption due to a faulty read or write operation.
The other main difference in memory controller design is how much RAM is supported. Intel based servers are about to start utilising a memory controller that is built on to the processor die, as has been the case with AMD based systems for years. Even the newest mainstream memory controllers support a maximum of 16GB of RAM. HP have recently announced a “virtualisation ready” Nahalem based server design that will support 128GB of RAM, which will be available by year’s end. Many modern servers provide mirrored memory features. A memory mirror essentially provides RAID 1 functionality for RAM the contents of your system memory are written to two separate banks of identical RAM modules. If one bank develops a fault, it is taken offline and the second bank is used exclusively. The memory controller of the server can usually handle this failover without the operating system even being aware of the change, preventing unscheduled downtime of the server.
Hot spare memory can also be installed in a bank of some servers. The idea here is that if the memory in one bank is determined to be faulty, the hot spare bank can be brought online and used in place of the faulty bank. In this scenario, some memory corruption can occur depending on the operating system and memory controller combination in use. The worst case scenario here usually involves a crash of the server, followed by an automated reboot by server recover mechanisms (detailed later on in this article). Upon reboot, the memory controller brings the hot spare RAM online limiting downtime. Hot swappable memory is often used in conjunction with both of the features giving you the ability to swap out faulty RAM modules without having to shut down the entire server.
Storage Controllers
Drive controllers are dramatically different in servers. Forget on board firmware based SATA RAID controllers that provide RAID 0, 1 and 1+0 and consume CPU cycles every time data is read or written to the array. Server class controllers have dedicated application specific integrated circuits (ASICs) and a bucket full of cache (sometimes as much as 512MB) in order to boost the performance of the storage subsystem. These controllers also frequently support advanced RAID levels including RAID 5 & 6.
The controller cache can be one of the most critical components of a server, depending on the application. At my place of employment, we have a large number of servers that capture video in HD quality at real time. A separate “ingest” server often pulls this data from the encode server immediately after it has been captured for further processing and transcoding. Having 512MB of cache installed on the drive controller allows data to be pushed out via the network interface before it has been physically written to disk, significantly boosting performance. Testing has revealed that if we reduced the cache size to 64MB, data has to be physically written to disk and then physically read when the ingest process takes place, placing significant additional load on the server. Finally, consider that most mainstream controllers have no cache whatsoever the impact on performance in this scenario would probably prevent us from working with HD quality content altogether.
But what happens if there is a power outage and the data that is in the controller cache has not yet been written to the disk? In order to prevent data loss, some controllers feature battery backup units (BBUs) that are capable of keeping the contents of the disk cache intact for in excess of 48 hours or until power is restored to the server. Once the server is switched on again, the controller commit the data from the cache to the disk array before flushing the cache and continuing with the boot process. No data is lost. BBUs are another feature missing from mainstream controllers.
The problem with RAID 5
Traditionally, RAID 5 has been the holy grail of disk arrays, providing the best compromise between performance and fault tolerance. However with the continual increase in storage density, RAID 5 is starting to exhibit a significant design flaw when the array has to be rebuilt after a disk failure.
RAID 5 arrays can tolerate the failure of a single drive in the array. If during the time that it takes to replace the faulty drive and rebuild the array, a second drive fails or an unrecoverable read error (URE) occurs on one of the surviving drives in the array, the rebuild will fail and all data on the array will be lost.
Most manufacturers will quote the probability of encountering a URE in the detailed specifications sheet for each drive. Most consumer grade products have a quoted URE of ~1 in 1014 ¬– which translates to an average of 1 URE encountered for every 12TB of data read. Now, imagine that you have a RAID 5 array containing four 1.5TB drives (which are now readily available) and one disk goes pear shaped. You replace the faulty drive and the rebuild process begins and 1.5TB of data is read from each remaining drives in order to rebuild the data on the new disk. Assuming that you have “average” drives, there’s around a 33% chance of encountering a URE while rebuilding the array, which would result in the loss of up to 4.5TB of data.
Back in the days when we were dealing with arrays containing five 32GB disks, the probability of a URE occurring during array rebuilds was miniscule. But nowadays, it’s not uncommon to see array configurations exceeding 2TB in size, containing eight or more large capacity drives. As a result of the increased number of drives and the increasing capacity of those drives, the probability of encountering a URE during the rebuild process is approaching the stage where RAID 5 arrays are unlikely to be successfully rebuilt in the event of a drive failure. And the more, larger capacity drives that you use in an array, the more likely a URE will occur during the rebuild.
RAID 6 is the solution that is commonly used to overcome the limitations of RAID 5. RAID 6 utilises two different parity schemes and distributes these parity blocks across drives in much the same manner as RAID 5 does. The use of two separate parity schemes essentially allows two drives in an array to fail while maintaining data integrity. While RAID 5 requires n+1 drives in the array, RAID 6 requires n+2 so you’ll be assigning the capacity of two whole drives to parity instead of one.
If the server that you’re building does not require a large amount of disk space, RAID 5 may be perfectly acceptable. However, if you’re deploying a large number of drives or large capacity drives in your server, you’ll want to ensure that you have a drive controller that supports RAID 6.
It should also be noted that while RAID 6 overcomes the issues that are starting to become prominent with RAID 5, it should be noted that a few years from now, RAID 6 will exhibit the same problem if used with larger arrays and drives of larger capacities than we have today. But until this day comes, RAID 6 remains a more reliable fault tolerance scheme than RAID 5.
Maths
Regardless of the scenario, we assume that all 1.5TB needs to be read from all drives in the array in order to perform a successful rebuild. This gives us a 12.5% probability of encountering a URE on a single drive (1.5 / 12 = 0.125), and a 87.5% probability of not encountering a URE (1 0.125 = 0.875).
As you can see from the above tables, you’re much more likely to achieve a successful rebuild with a RAID 6 array however even this probability of success is lower than what some would desire. This only re enforces the fact that RAID 6 is significantly better than RAID 5, but will also experience the same issues assuming that URE rates don’t increase with disk capacity.
And on a side note, I was the unfortunate victim of a rebuild failure due to UREs about a year ago I accidentally knocked a power cord out of an seven 250GB drive RAID 5 NAS enclosure (the enclosure was four years old and did not support RAID 6, but we did have it configured with one hot spare drive). Knocking the cable out abruptly killed one of the redundant power supplies, which took one of the drives with it. The hot spare drive was immediately activated and the array began to be rebuilt. About 5 hours into the rebuild, a URE occurred and the rebuild failed.
It’s just as well we had that 1.5TB worth of data backed up on to a second array as well as LTO tape this just goes to show that RAID arrays are not the be all and end all of fault tolerance.
External Storage
Any computer chassis has a physical limitation to the number of drives that you can install. This limitation is overcome in enterprise servers by connections to Storage Area Networks (SANs). This is typically accomplished in two ways via a fibre channel or iSCSI interfaces.
iSCSI is generally the cheaper option of the two because data transferred between the SAN and server is encapsulated in frames sent over ubiquitous Ethernet networks, meaning that existing Ethernet interfaces, cabling and switches can be used (aside from the cost of the SAN enclosure itself, the only additional costs are generally an Ethernet interface module for the SAN and software licenses).
On the other hand, fibre channel requires its own fibre optic interfaces, cabling and switches, which significantly drives up cost. However, having a dedicated fibre network means that bandwidth isn’t shared with other Ethernet applications. Fibre channel presently offers interface speeds of 4Gb/s compared to the 1Gb/s often seen in most enterprise networks. Fibre channel also has less overhead than Ethernet, which provides an additional boost to comparative performance.
Disk Drives
For years, enterprise servers have utilised SCSI hard disk drives instead of ATA variants. SCSI allowed for up to 15 drives on a single parallel channel versus the 2 on a PATA interface; PATA drives ship with the drive electronics (the circuitry that physically controls the drive) integrated on the drive (IDE), whereas SCSI controllers performed this function in a more efficient manner; many SCSI interfaces provided support for drive hot swapping, reducing downtime in the event of a drive failure; and the SCSI interface allowed for faster data transfer rates than what could be obtained via PATA, giving better performance, especially in RAID configurations.
However over the last year, Serial Attached SCSI (SAS) drives have all but superseded SCSI in the server space in much the same way that SATA drives have replaced their PATA brethren. The biggest problem with the parallel interface was synchronising clock rates on the many parallel connections serial connections don’t require this synchronisation, allowing clock rates to be ramped up and increasing bandwidth on the interface.
SAS drives still the same as SCSI drives in many ways the SAS controller is still responsible for issuing commands to the drive (there is no IDE), SAS drives are hot swappable and data transfer over the interface is faster compared to SATA. SAS drives come in both 2.5 and 3.5 inch form factors with the 2.5 inch size proving popular in servers as they can be installed vertically in a 2RU enclosure.
In addition, SAS controllers can support 128 directly attached devices on a single controller, or in excess of 16,384 devices when the maximum of 128 port expanders are in use (however, the maximum amount of bandwidth that all devices connected to a port expander can use equals the amount of bandwidth between the controller and the port expander). In order to support this many devices, SAS also uses higher signal voltages in comparison to SATA, which allows the use of 8m cables between controller and device. Without using higher signal voltages, I’d like to see anyone install 16,384 devices to a disk controller with a maximum cable length of 1 meter (the current SATA limitation).
In the next few months, there will be another major advantage to using SAS over SATA in servers. SAS does support multipath I/O. Suitable dual port SAS drives can then connect to multiple controllers within a server, which provides additional redundancy in the event of a controller failure.
GPUs and Video
One of the areas where enterprise servers are inferior to regular PCs is in the area of graphics acceleration. Personally, I’m yet to see a server that has been installed within a data centre that contains a PCI Express graphics adapter but that’s not to say that it’s not possible to install one in an enterprise server. In general though, most administrators find the on board adapters more than adequate for server operations.
Networking
Modern day desktops and laptops feature Gigabit Ethernet adapters, and the base adapters seen on servers are generally no different. However, like most other components in servers, there are a few subtle differences that improve performance in certain scenarios.
In order to provide network fault tolerance, two or more network adapters are integrated on most server boards. In most cases, these adapters are able to be teamed. Like RAID fault tolerance schemes, there are numerous types of network fault tolerance options available, including :
• Network Fault Tolerance (NFT) In this configuration, only one network interface is active at any given time, which the rest remain in a slave mode. If the link to the active interface is severed, a slave interface will be promoted to be the active one. Provides fault tolerance, but does not aggregate bandwidth.
• Transmit Load Balancing (TLB) Similar to NFT, but slave interfaces are capable of transmitting data provided that all interfaces are in the same broadcast domain. This provides aggregation of transmission bandwidth, but not receive and also provides fault tolerance.
• Switch assisted Load Balancing (SLB) and 802.3ad Dynamic provides aggregation of both transmit and receive bandwidth across all interfaces within the team, provided that all interfaces are connected to the same switch. Provides fault tolerance on the server side (however, if the switch that is connected to the server fails, you have an outage). 802.3ad Dynamic requires a switch that supports the 802.3ad Link Aggregation Control Protocol (LACP) in order to dynamically create teams, whereas SLB must be manually configured on both the server and the switch.
• 802.3ad Dynamic Dual Channel provides aggregation of both transmit and receive bandwidth across all interfaces within the team and can span multiple switches, provided that they are all in the same broadcast domain and that all switches support LACP.
Just about all server network interface cards (NICs) support Virtual Local Area Network (VLAN) trunking. Imagine that you have two separate networks an internal one that connects to all devices on your LAN, and an external on that connects to the Internet, with a router in between. In conventional networks, the router needs to have at least two network interfaces one dedicated to each physical network.
Provided that your network equipment and router supports VLAN trunking, your two networks could be set up as separate VLANs. In general, your switch would keep track of which port is connected to which VLAN (this is known as a port based VLAN), and your router is trunked across both VLANs utilising a single NIC (physically, it becomes a router on a stick). Frames sent between the switch and router are tagged so that each device knows which network the frame came from or is destined to go to.
VLANs operate in the same physical manner as physical LANs but network reconfigurations can be made in software as opposed to forcing a network administrator to physically move equipment.
Because of the sheer amount of data that is received on Gigabit and Ten Gigabit interfaces, it can become exhaustive to send Ethernet frames to the CPU in order for it to process TCP headers. It roughly requires around 1GHz of processor power to transmit TCP data at Gigabit Ethernet speeds.
As a result, TCP Offload Engines are often incorporated into server network adapters. These integrated circuits process TCP headers on the interface itself instead of pushing each frame off to the CPU for processing. This has a pronounced effect on overall server performance in two ways not only does the CPU benefit from not having to process this TCP data, but less data is transmitted across PCI express lanes toward the Northbridge of the server. Essentially, TCP Offload engines free up resources in the server so that they can be assigned to other data transfer and processing needs.
The final difference that you see between server NICs and consumer ones is that the buffers on enterprise grade cards are usually larger. Part of the reason for this is due to the additional features mentioned above, but there is also a small performance benefit to be gained in some scenarios (particularly inter VLAN routing).
Power Supplies
One of the great features about ATX power supplies is the standards that must be adhered to. ATX power supplies are always the same form factor and feature the same types of connectors (even if the number of those connectors can vary). But while having eight 12 volt Molex connectors is great in a desktop system, this amount of connectors is generally not required in a server, and the cable clutter could cause cooling problems.
Power distribution within a server is well thought out by server manufacturers. Drives are typically powered via a backplane instead of individual Molex connectors and fans often drop directly into plugs on the mainboard. Everything else that requires power draws it from other plugs on the mainboard. Even the power supplies themselves have PCB based connectors on them. All of this is designed to help with the hot swapping of components in order to minimise downtime.
Most servers are capable of handling redundant power supplies. The first advantage here is if one power supply fails, the redundant supply can still supply enough juice to keep the server running. Once aware of the failure, you can then generally replace the failed supply while the server is still running.
The second advantage requires facility support. Many data centres will supply customer racks with power feeds on two separate circuits (which are usually connected to isolated power sources). Having redundant power supplies allows you to connect each supply up to a different power source. If power is cut to one circuit, your server remains online because it can still be powered by the redundant circuit.
Server Management
Most servers support Intelligent Platform Management Interfaces (IPMIs), which allow administrators to manage aspects of the server and to monitor server health including when the server is powered off.
For example, say that you have a remote Linux server that encountered a kernel panic you could access the IPMI on the server and initiate a reboot, instead of having to venture down to the data centre, gain access and press the power button yourself. Alternatively, say that your server is regularly switching itself on and off every couple of minutes too short a time for you to log in and perform any kind of troubleshooting. By accessing the IPMI, you could quickly determine that a fan tray has failed and the server is automatically shutting down once temperature thresholds are exceeded. These are two of the most memorable scenarios where having access to IPMIs has saved my skin.
Many servers also incorporate Watchdog timers. These devices perform regular checks on whether the Operating System on the server is responding and will reboot the server if the response time is greater than a defined threshold (usually 10 minutes). These devices can often minimise downtime in the event of a kernel panic or blue screen.
Finally, most server vendors will also supply additional Simple Networking Management Protocol (SNMP) agents and software that allows administrators to monitor and manage their servers more closely. The agents that are often supplied provide just about every detail about the hardware installed that you could ever want to know how long a given hard disk drive has been operating in the server, the temperature within a power supply or how many read errors have occurred in a particular stick or RAM. All of this data can be polled and retrieved with an SNMP management application (even if your server provider doesn’t supply you with one of these, there are dozens of GPL packages available that utilise the Net SNMP project).
The future...
All of the points detailed in this article and within the corresponding article on the APC website highlight the differences that are seen between today’s high end consumer gear (which is typically used to make the DIY server) and enterprise level kit. However, emerging technologies will continue to have an impact on both the enterprise and consumer markets.
As the technology becomes more refined, solid state drives (SSDs) will start to emerge as a serious alternative to SAS hard disk drives for some server applications. Initially, they’ll most likely be deployed where lower disk capacity and lower disk access times are required (such as database servers). When the capacity of these drives increases, they’ll start to become more prominent but will probably never replace the hard disk drive for storing large amounts of data.
The other big advantage to using SSDs is that the RAID 5 issue mentioned earlier becomes less of an issue. SSDs shouldn’t exhibit UREs once data is written to the disk, it’s stored physically, not magnetically. A good SSD will also verify that the contents of a block including whether it can be read before the write operation is deemed to have succeeded. Thus, if the drive can’t write to a specific block, it should be marked as bad and a reallocation block should be brought online to take its place. Your SNMP agents can then inform you when the drive starts using up its reallocation blocks, indicating that a drive failure will soon happen. In other words, you’ll be able to predict when an SSD fails with more certainty, which could give RAID 5 a new lease of life.
Moving further forward, the other major break from convention in server hardware will most likely be a move toward the use more application specific processor units instead of the CPU as we know it today. There’s already some movement in this area Intel’s Larrabee is an upcoming example of a CPU/GPU hybrid, and the Cell Broadband Engine Architecture (otherwise know as the Cell architecture) that is used in Sony’s Playstation 3 is also used in the IBM RoadRunner supercomputer (the first to sustain performance over the 1 petaFLOPS mark)
7 komentar :
nice share my friend.:D
bentar ya brader... saya ambil kamus dulu.... hik hik hik
braderku link km sudah saya pasang,
:)
link saya dmn?
kemaren kantorku beli yg IBM xseries
Good Post Friend for making any server... Keep blogging..
Wah komplit bgtt nie Sob..hhe.....TOP dah.....
@gambutku...
@akhatam...
@Ferdinand...
ma kasih sob
@harto...
hehehe
@nino Nur'madi...
sebelumx mhon perjelas link profilex...biar ketahuan mana link blogx...sempat uda terpasang linkx...
@attayaya...
IBM hebat pko'e
d'tempatx Ambae.exe
kebetulan pux yg IBM system
Seri x3105 Type 4347
Post a Comment
Your comments are inputs for our