Monday, 10 August 2009

UPS replacement

I recently had to replace a computer room UPS. The existing 40 kVA unit was replaced with a new 80 kVA unit. It was a major job, both interesting and a little intimidating due to the magnitude of the job, the criticality of the room to the business and irreversible nature of the change.



The old UPS on the morning of the replacement


A lot of investigation and planning work went into the job, including:

  • sizing the unit to the building electrical capability including generator and the CRAC cooling capacity (we have also upgraded the CRAC units)

  • obtaining good quotes

  • confirming the building wiring requirements

  • checking the floor loading of the raised floor

  • specifying the circuit layout

  • bringing in the existing subboard

  • planning a total power outage to a multi-storey building and entire computer room

  • planning the shutdown and restart of the computer room

  • determining how the unit would arrive, be installed and the old unit removed

  • confirming the physical demensions of the unit

  • coordinating our work with major electrical work being done in the building at the same time


  • I had great assistance from someone in my team and the building supervisors, both qualified electricians with the specification and design. Three vendors were approached with two completing a detailed electrical investigation. The accepted quote included the new UPS, all of the electrical work from the basement switchboard as well as the removal of the old UPS and the installation of the new UPS. The intention of this was to ensure a single point of accountability.



    This is what a 200 amp fuse looks like, it's about 10 cm wide, they should make it look a lot scarier, maybe a big skull on it or just "Stay the hell away!".


    Mains electricity scares the hell out of me, I don't like going anywhere near subboards. You don't know what some dodgey bugger has done in the past.


    Besides being double the size of the existing UPS, the UPS work provided a number of other enhancements:


    • the new UPS has an obvious Emergency Power Off (EPO) button that when depressed for a few seconds will completely drop the power to the racks, I chose not to have a separate wall button because I thought it was unnecessary and potentially problematic, one can be wired in very easily later to the two contacts in the UPS

    • the UPS circuit was put on it's own isolation circuit with a 200 amp fuse so future work on the building electricals can be done without an outage to the room, this was a major change that will ensure availability of the computer services to customers, 200 amp fuses were installed in the main building switchboard for this circuit

    • the new UPS, unlike the previous UPS has a SNMP/Web card enabling remote management of the UPS and full alerting capabilities

    • a hardware bypass switch was installed for maintenance activities, the previous UPS had an internal bypass only

    • the new switchboard allows the safe installation of additional circuits in the room without a power down to the board and the room


    Before the UPS install day the new switchboard and bypass switch cabinet were installed, circuits run underfloor and the cabling from the basement switchboard run to the bypass switch. Basically everything that could be done before the outage was done to minimise work on the day. The underfloor circuits are 32 amp with a Clipsal captive (screw) plug to ensure plugs can not be accidentially removed or knocked out. On the day the existing powerboards had new plugs fitted to suit the new captive sockets. The intention is to run two independant 32 amp circuits to each of the eleven server racks and two communications cabinets in the room. So a circuit being tripped won't take anything down. We don't have the building capacity for redundant UPSes. The existing subboard in the room was wired into the new board allowing the older circuits including wall GPOs on UPS to be still used.




    1.6 tonnes of new batteries




    Floor protection to get the battery pallets in




    Beafy new clipsal captive plugs




    The new UPS with open battery cabinet, wiring not completed



    On the day prior to the UPS install, because the entire building power would be downed power leads were run from an adjacent building and floodlights installed to provide lighting. We had some contingency equipment including:


    • spare leads

    • many spare floodlights

    • several torches

    • a 7 kVA generator to provide power in case there was a problem with the power from the adjacent building or an area wide power outage

    • printed documentation




    Complex insides of the new UPS



    Work started at 4am on the day of the shutdown and UPS install. The shutdown proceeded smoothly and the old UPS was deconnected and removed without incident. We were surprised to find 20 amps being drawn from the old UPS with the entire computer room powered off, and took the opportunity to identify UPS power circuits outside the computer room by individually turning off switchboard circuits and noting the load on the UPS. The non-computer room load included an old communications cabinet in one of our adjacent buildings (!) that used to be a computer room. This really highlighted the importance of documentation and the fragility of relying on people with tacit knowledge of how things are configured.


    The new UPS arrived on time, and consisted of the 500kg UPS unit, a battery cabinet and two pallets of batteries each weighing 800kg for a total of 1.6 tonnes of batteries that will power the UPS on full load for 30 minutes (the room also has automatic generator power). I wouldn't like to have anything less than 30 minutes to allow some time for a graceful room shutdown in the event of a problem with the generator.


    The new UPS was shipped, and the old UPS removed by Hi-Tech Express! a firm specialising in moving sensitive technology equipment. They did a great job and I have no hestiation in recommending them. To avoid the really heavy pallets damaging the floor metal plates were used to protect the tiles and carpets. To help with the back breaking task of putting in the batteries a small lift was used.


    Work proceded on schedule until the battery cabinet went in. Installing the dozens of metal bus bars that connect the battery terminals together took an extraordinary about of time. So to did the actual wiring of the UPS into the mains supply. During the installation and after the completion of the wiring of the new UPS extensive testing was conducted of the cabling. The commissioning of the UPS itself took over an hour and involved testing the output of the UPS.


    We were only thirty minutes behind schedule when the UPS was finally ready to supply power to the room. Unfortunately we then encountered a problem that delayed the room startup, but eventually the room was powered on and operational on the new UPS.


    I learnt on a lot on this project. Doing any major electrical work in an existing building is complex, potentially problematic and highly dependent on tacit knowledge if documentation is poor.

    Labels:

    Tuesday, 6 May 2008

    SAN SP Problem

    Going to start blogging work. Might help someone Googling.

    Had a weird problem today with one of the Dell badged EMC Clariion CX-500 SANs I look after. I have two identical SANs, one at the main office and another 5 kilometres away linked via our own single mode fibre to which we replicate some LUNs. One of the storage processors (SP) in the main SAN became "unmanaged", dropping out of the Navisphere management console and the management IP address not being pingable. This broke a number of mirrors using that SP. With several terrabytes of mission critical data, SAN errors are something I don't like to see! The problem started at 5am in the morning, so I didn't think it was a user initiated screwup or change - not even I am working that late! I checked the obvious of flashing NIC activity lights on the SP NIC, replaced cabling and used a different switch port and checked that someone hadn't changed the VLAN or port speed of the switch port the SP was plugged into. Also tried pinging from the same subnet in case it was a default gateway problem on the SP. All ok. Tried plugging a laptop with a crossover cable directly into the SP NIC and on the same subnet, the management IP was still not reachable.

    Looking at the SP, there is a RJ-45 console port which helpfully ISN'T actually a normal console port, but a port that requires you to establish a PPP dial up networking connection to. After establishing a PPP network connection and accessing the HTTP setup page (http://192.168.1.1/setup), the setup page displayed the correct IP address of the SP. So it seemed to have the correct setup info, committing these settings the SP then came up. So basically the SP had somehow lost it's management IP settings or required that to be reset. Nice to have it fixed but still not sure why it happened.

    Labels:

    Sunday, 16 December 2007

    ESX Server: Timed out

    Putting this up as I couldn't find it via Google or the VMWare forums when I encountered this problem:

    I have ESX Server 3 and Virtual Centre 2, and powered down a virtual machine that ran one of DNS servers. I have DNS running on both physical and virtual servers. I then found I was unable to start that virtual machine or any other virtual machine getting a "Timed out waiting for the server response" error in the Virtual Infrastructure Client. This is despite the fact that other non-virtual DNS servers were defined on the ESX and Virtual Centre servers. Name resolution on the Virtual Centre Server seemed ok, I flushed the DNS and was then able to check name resolution ok. I also tried unsuccessfully changing the default DNS server on the Virtual Centre server to a physical DNS server. I found closing Virtual Infrastructure client I was then unable to reopen the client and logon to the Virtual Centre server.

    Not being able to start VMs was very disconcerting. To fix this I opened Virtual Infrastructure Client and logged directly to the ESX server hosting the virtual machine I powered down rather than the Virtual Centre server. I then sucessfully restarted the virtual machine running the DNS server. After this server was restarted, I then restarted the Virtual Centre server, then reopened the Virtual Infrastructure client and logged on successfully to the Virtual Centre server. All working ok again. I then changed the default DNS server on the Virtual Centre server to a physical DNS server to try to avoid a reoccurance.

    Labels:

    Saturday, 10 November 2007

    Messy Cabling

    Just wanted to share my collection of messy cabling pictures sourced from different sites around the Internet.

    My racks are not this bad!

    Gotta love the devices hanging down:














    Love the cooling fan in this one, but at least they've chained the cable distribution record book to the frame!:





    Doesn't appear to be a single cable management rail in this one, big hanging roll of unused orange cable is a great touch as is the use of double adaptors::




    This one is my favourite:

    Links: www.ratemynetworkdiagram.com

    Labels: