| (no subject) |
[Jun. 25th, 2009|02:18 pm] |
| [ | Current Mood |
| | amused | ] | We have a "backup" system that consists of syncing files off to remote storage at work. It's a temporary workaround while we're waiting for a real backup system out of procurement. There is already a procedure to do the backups. (Run a script.) I was asked to detail the restore procedure.
I wonder if anyone will notice the extra bits. Or if it will pass the chuckle test.
Manual System Restore Procedure:
1. Tools
The following tools should be at hand for recovery:
- OS install media of a matching version of the system to restore
- Knoppix CD
- Clonezilla CD (version compatible with hardware)
- Stress Linux CD
- Alcohol, preferably scotch or rum.
Optional tools:
- 3 lb engineering hammer (5+ lb sledge hammer is an acceptable substitute)
- LFS live CD
- Soda water and other mixers
- rkhunter / chrootkit CD
- Doritos (cool ranch)
- Video game system
1. Prepare base system
You want to install the former files onto a system as closely referencing the old as possible. The more variances in hardware or system layout, the more complicated returning the system to full functionality.
The goal here is to initially bootstrap a similar revision of the OS that you will be restoring.
If a system has only partial loss, and can be brought up in single user mode, then that is sufficient. If possible, run stresslinux on the system for at least an hour, (preferably a day) to confirm the hardware is truly not at fault.
If the system is a complete loss, (IE: dead drive and no OS on new drive) it is preferable to use clonezilla to clone a peer in the same environment. If no peers are available, install a minimal-package install of the OS of the same brand and major revision onto the system.
During this step application of the toolset "alcohol" with optional usage of the toolset "Doritos” and toolset "video game system" is appropriate.
2. Assessing the state of the backup
Files that have been backed up are stored in the archive log volume on the DotHill logserver virtual server. They can be reviewed on [machine name redacted] in: "/archive_logs/images-rsync" Each directory is the hostname of the backed up system. "Frogger" would be in: "/archive_logs/images-rsync/frogger". If you have trouble understanding that previous sentence, apply the toolset alcohol until clarity is achieved.
Within each host directory there are two sub directories and one or more files:
- "main-backup" – This directory is the location of the most recent files that were backed up. Most of our work will copy from here.
- "increments" – As multiple backups are performed, any files that changed or were deleted have their former versions stored here in dated subdirectories. This is a handy location to grab any "accidentally" deleted files on a system that is backed up regularly. This usually accompanies an opportunity to explain that "rm" is not the "remark" command by which you annotate files on a system.
- "bootsector.?da.dd" – These files represent the 512-btye boot sector of the boot device for the system. They can be dd'ed back onto the raw device when some bright person overwrites the boot loader. The optional toolset "engineering hammer" is appropriate for the RCA discussion with the user that did this.
Within the main-backup directory should be rsynced copies of the filesystem for this host as appearing from root, with all permissions. A usual backup will have entries such as: bin, etc, lib, media, mnt, opt, sbin, srv, var, boot, export, home, lib64, misc, net, root, selinux, usr
You should review fstab (/archive_logs/images-rsync/frogger/main-backup/etc/fstab in our example) to verify that your drive layouts are similar. Boot labels or boot devices should be the same so the restored boot-loader will function on the new system. While bricks were the foundation of our society, turning computers into them is not necessarily our goal.
Also, as they are backed-up daily, config files and changes can be reviewed on [machine name redacted] for most of our systems. Often [this machine] will be more current than the backup.
If the backup layout appears to sufficiently match the prepared system, we can move to the next step. Optional application of the "alcohol" toolset is appropriate here. As is cursing aloud at the "idiot" (variations: "asshole", "Tom-noddy muttonheaded juggins", "daft git", "customer") that killed this system if there is a direct party at fault. (Assignment of blame in blameless scenarios is covered later in this procedure and should not be addressed at this step.)
3. Restoring files
Boot the prepared base system from a rescue CD. (Preferrably the OS install disk, otherwise Knoppix) You will need to mount all partitions that should be restored and also bring up the network. Network settings can be reviewed either in the backup on ops or from the various config files stored in [change control machine].
If you can’t bring the system up on external media, this procedure can work from single-user mode but is slightly more dangerous. I highly recommend you invoke a statically linked shell (such as "ash") before starting your work.
If the system is in the XXX.XXX.0.0/16 network, then it should be able to directly mount the archive log partitions for rsync. If not, rsync can be done over SSH to root on ops. For a NFS mount it would be something like this:
%> mkdir -pv /mnt/backup
%> mount -t nfs XXX.XXX.XXX.XXX:/archive_log/images-rsync/frogger/main-backup For each of the major directories in the backup, you want to rsync the contents of the backup onto the new. The examples below are assuming the system is mounted as “"/mnt/recover" and the backup is on "/mnt/backup"
%> for dir in `ls /mnt/backup/`;
do mkdir -pv /mnt/restore/$dir; rsync -av --delete /mnt/backup/$dir/ /mnt/restore/$dir;
done For systems that cannot mount the NFS partition:
%> for dir in “SPACE SEPERATED LIST OF DIRECTORIES IN BACKUP”;
do mkdir -pv /mnt/restore/$dir;
rsync -av --delete --rsh=ssh root@[machine name]:/archive_logs/images-rsync/HOSTNAME/main-backup/$dir/ /mnt/restore/$dir;
done rsync should report on STDOUT the files it is currently transferring.
Note that it is rather important to rsync if the source and destination directories have a slash at the end or not. A configuration error on that will rsync the data into subdirectories, causing a mass-delete a lot of files that would otherwise just be stat-ed, transferring the full volume of files over the network (as opposed to a delta), and correspondingly taking much longer to process. This means you may potentially exhaust the alcohol and Doritos toolsets.
It is expected that this step will consume the most time, the toolset "video game system" is most appropriate here.
Also, the downtime of staring at rsync will give you time to consider where to assign the blame. If no specific person can be cited for blame it is customary that the person whom has been longest off-call in operations is the most deserving. This may be an appropriate resolution of blame even when there is a clear case for others to be at fault. Be aware that under certain circumstances involving higher level employees, recursive blame assignments may be triggered.
4. Bootstrapping the restored system
If the format of the underlying partitions has been unavoidably changed between backup and restoration you will need to reconfigure and re-install grub or LILO on the boot sector used by the system. If the partitions and versions line up, this should not be necessary. That being said, there is a backup copy of the boot sector that matches the backup that can be dd'ed back in place as a fallback.
Now it is time to reboot the system. Deprive system of life giving electricity until it expires, then resuscitate via restoration of power.
During the reboot, options are varied. Depending on your background and religious preference as a sysadmin, any of the following actions may be customary while rebooting:
- Curse loudly at the computer
- Use the alcohol toolset
- Perform the incantation of blame with the selected party from step 3
- Appropriate blood sacrifices
- Blame Microsoft
- Threaten system with retribution from the "engineering hammer” tool
- All of the above
Hopefully with the proper combination of actions and with the successful execution of previous recovery steps, the system should boot.
5. Check system health
Once a system is booted a thorough health check should be performed. Some of the steps may include:
- Check for hardware errors in logs and IPMI
- Verify that swap is active, all memory shows as present
- Perform and fsck of all partitions. It’s customary to take a shot on each bad inode.
- Check for any orphan files in lost+found
- Remove any inappropriate lost+found directories if file-system layout has changed.
- Verify network is present and routing
- Restart and verify individual apps
At this point the sysadmin will likely need to be rebooted. Some of the steps may include:
- Put away tools
- Go to the "Pink poodle"
- Play video games
- Sleep
|
|
|