Mar 27, 2010
What better way to celebrate summer solstice, than by making my computer able to hibernate? Since my last post a lot has happened with me. I got a new phone (HTC Hero FTW!), I finished university, went traveling a bit and I also got a new notebook (because the old one died on me). R.I.P. Thinkpad R51, welcome Thinkpad T500. There are several things I could start writing about now. Starting with how great Hero and Android is to use all the way to today's blog post: How to make my computer hibernate?
Linux has had support for hibernating for quite a few years now and although it's not perfect, it usually works out of the box. What it needs however is swap device big enough so that it can store image of memory for hibernating. Now I hit a problem. When I got my new Thinkpad I thought "Hey, I have 4GB of RAM...why would I need a swap?". And even if I REALLY needed more than 4GB RAM I can still create temporary swap by using swapfile. Unfortunately I couldn't make swapfile on LVM work with TuxOnIce. TuxOnIce also has another alternative to swap or swapfile for hibernating: Using filewriter, which is quite similar to swapfile support, I managed to get it to work (after some work, kernel debugging and one small patch to TuxOnIce).
I set FilewriterLocation in hibernate.conf to point to a place where I wanted to store hibernation file and I set the size to 4GB. As instructed in TuxOnIce HOWTO, I then ran
So what can we deduce from this bug output? Let's go from the top. Bug name (unable to handle kernel paging request) means that it is likely a memory corruption issue. Someone accessed memory that he was not supposed to. IP tells us that function where the error occurred was toi_attr_store in unknown file, unknown line (I don't have debug information included in kernel). There are other information we can get from that output, but I didn't really need them. Quick search through kernel sources told me that toi_attr_store is a function inside kernel/power/tuxonice_sysfs.c. I scanned the code, learning what approximately it did. Then I placed printk statements thorough the function so that I could approximate where inside the function the code fails. After some time I narrowed it down to following snippet:
Kernel crashed when it tried to call toi_free_page. After a few reboots and printks later I found out that this was just a coincidence, and sysfs_data variable itself became corrupt even before the call to the toi_free_page. Good candidate? Of course: write_side_effect. But what exactly was write_side_effect? This function was passed as an argument, and therefore I wasn't able to easily find out what was real code executed at this point. Time to find out! From my previous debugging attempts I knew code failed while it tried to write location of my resume file into /sys/power/tuxonince/file/target. TuxOnIce code defined handling for string sysfs arguments as such:
I found this macro used inside tuxonice_file.c source code like this:
So we found our write_side_effect code inside test_toi_file_target function. In one part this function was calling hex_dump_to_buffer to convert device UUID into hexadecimal string. The call looked like this:
This should convert input (fs_info->uuid) into hexadecimal string and store it inside buf. Author of the original code correctly thought about function adding spaces between bytes and therefore need to have more space in the buffer (argument 50 is telling hex_dump_to_buffer how big is output buffer). Unfortunately that same author declared buf as 33 char array. hex_dump_to_buffer therefore stepped outside the buffer and corrupted memory, causing all the problems. I fixed this bug, and sent a patch to the tuxonice-devel mailing list. As of now, it is already in the git repository ready to be released with next bugfix release of TuxOnIce.
That is everything for today, but as I already noted I am using LVM on my system (except root partition) and also use fbsplash for nice animations while rebooting. I am using initrd for this, and I will have another post on that topic.
Linux has had support for hibernating for quite a few years now and although it's not perfect, it usually works out of the box. What it needs however is swap device big enough so that it can store image of memory for hibernating. Now I hit a problem. When I got my new Thinkpad I thought "Hey, I have 4GB of RAM...why would I need a swap?". And even if I REALLY needed more than 4GB RAM I can still create temporary swap by using swapfile. Unfortunately I couldn't make swapfile on LVM work with TuxOnIce. TuxOnIce also has another alternative to swap or swapfile for hibernating: Using filewriter, which is quite similar to swapfile support, I managed to get it to work (after some work, kernel debugging and one small patch to TuxOnIce).
I set FilewriterLocation in hibernate.conf to point to a place where I wanted to store hibernation file and I set the size to 4GB. As instructed in TuxOnIce HOWTO, I then ran
hibernate --no-suspendto create this image. It created the file as expected, but when it was supposed to tell me settings for bootloader (resume argument) it silently failed. When I tried again, whole computer froze. I was puzzled. How could this happen? I am using Linux so things like this don't happen! But hey, I should be able to figure out what's wrong with it right? I set up my kernel to include netconsole, and ran hibernate again. This time I caught where the bug happened. The output was something like this:
TuxOnIce: No image found.
BUG: unable to handle kernel paging request at 6539207a
IP: [<c10763a6>] toi_attr_store+0x186/0x2a0
*pdpt = 0000000032732001 *pde = 0000000000000000
Oops: 0000 [#1] PREEMPT SMP
last sysfs file: /sys/power/tuxonice/file/target
Modules linked in: netconsole aes_i586 aes_generic radeon ttm drm_kms_helper drm
i2c_algo_bit sco bnep ipt_MASQUERADE iptable_nat nf_nat ipt_LOG nf_conntrack_ip
v4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_t
ables x_tables rfcomm l2cap vboxnetadp vboxnetflt vboxdrv arc4 iwlagn iwlcore ma
c80211 sdhci_pci snd_hda_codec_conexant sdhci pcmcia e1000e uvcvideo mmc_core cf
g80211 snd_hda_intel yenta_socket btusb rsrc_nonstatic tpm_tis pcspkr pcmcia_cor
e videodev v4l1_compat intel_agp wmi agpgart tpm snd_hda_codec tpm_bios video fu
se xfs raid10 raid1 raid0 md_mod scsi_wait_scan sbp2 ohci1394 ieee1394 usbhid uh
ci_hcd usb_storage ehci_hcd usbcore sr_mod sg uvesafb cfbfillrect cfbimgblt cn c
fbcopyarea [last unloaded: microcode]
Pid: 12870, comm: hibernate Not tainted 2.6.33.1-w0rm #16 2082BRG/2082BRG
EIP: 0060:[<c10763a6>] EFLAGS: 00010202 CPU: 0
EIP is at toi_attr_store+0x186/0x2a0
EAX: 00000000 EBX: 36203430 ECX: 00000000 EDX: f231f200
ESI: 65392066 EDI: 00f60062 EBP: f6006331 ESP: f62a7f14
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process hibernate (pid: 12870, ti=f62a6000 task=f20a0270 task.ti=f62a6000)
Stack:
00000000 fffffff4 00000001 c1790ca0 00000000 f6e8ab64 c16c75a4 f6d1c380
<0> f62a7f64 c114298d 00000015 00000015 b7709000 f21385c0 f6d1c394 c16c75a4
<0> f6ec7ac0 f21385c0 b7709000 00000015 f62a7f8c c10f207c f62a7f98 00000000
Call Trace:
[<c114298d>] ? sysfs_write_file+0x9d/0x100
[<c10f207c>] ? vfs_write+0x9c/0x180
[<c11428f0>] ? sysfs_write_file+0x0/0x100
[<c10f221d>] ? sys_write+0x3d/0x70
[<c1002ccc>] ? sysenter_do_call+0x12/0x22
Code: c7 45 e0 00 00 00 00 3b 5d 08 0f 85 e9 fe ff ff 8b 46 20 85 c0 0f 84 de fe
ff ff ff d0 8b 7d e0 85 ff 8d 76 00 0f 84 d9 fe ff ff <8b> 46 14 31 d2 e8 60 03
05 00 8b 46 10 c7 46 14 00 00 00 00 a8
EIP: [<c10763a6>] toi_attr_store+0x186/0x2a0 SS:ESP 0068:f62a7f14
CR2: 000000006539207a
---[ end trace 124a5ee29ef71277 ]---
So what can we deduce from this bug output? Let's go from the top. Bug name (unable to handle kernel paging request) means that it is likely a memory corruption issue. Someone accessed memory that he was not supposed to. IP tells us that function where the error occurred was toi_attr_store in unknown file, unknown line (I don't have debug information included in kernel). There are other information we can get from that output, but I didn't really need them. Quick search through kernel sources told me that toi_attr_store is a function inside kernel/power/tuxonice_sysfs.c. I scanned the code, learning what approximately it did. Then I placed printk statements thorough the function so that I could approximate where inside the function the code fails. After some time I narrowed it down to following snippet:
if (!result)
result = count;
/* Side effect routine? */
if (result == count && sysfs_data->write_side_effect)
sysfs_data->write_side_effect();
/* Free temporary buffers */
if (assigned_temp_buffer) {
toi_free_page(31,
(unsigned long) sysfs_data->data.string.variable);
sysfs_data->data.string.variable = NULL;
}
Kernel crashed when it tried to call toi_free_page. After a few reboots and printks later I found out that this was just a coincidence, and sysfs_data variable itself became corrupt even before the call to the toi_free_page. Good candidate? Of course: write_side_effect. But what exactly was write_side_effect? This function was passed as an argument, and therefore I wasn't able to easily find out what was real code executed at this point. Time to find out! From my previous debugging attempts I knew code failed while it tried to write location of my resume file into /sys/power/tuxonince/file/target. TuxOnIce code defined handling for string sysfs arguments as such:
#define SYSFS_STRING(_name, _mode, _string, _max_len, _flags, _wse) { \
.attr = {.name = _name , .mode = _mode }, \
.type = TOI_SYSFS_DATA_STRING, \
.flags = _flags, \
.data = { .string = { .variable = _string, .max_length = _max_len } }, \
.write_side_effect = _wse }
I found this macro used inside tuxonice_file.c source code like this:
SYSFS_STRING("target", SYSFS_RW, toi_file_target, 256,
SYSFS_NEEDS_SM_FOR_WRITE, test_toi_file_target)
So we found our write_side_effect code inside test_toi_file_target function. In one part this function was calling hex_dump_to_buffer to convert device UUID into hexadecimal string. The call looked like this:
hex_dump_to_buffer(fs_info->uuid, 16, 32, 1, buf, 50, 0);
This should convert input (fs_info->uuid) into hexadecimal string and store it inside buf. Author of the original code correctly thought about function adding spaces between bytes and therefore need to have more space in the buffer (argument 50 is telling hex_dump_to_buffer how big is output buffer). Unfortunately that same author declared buf as 33 char array. hex_dump_to_buffer therefore stepped outside the buffer and corrupted memory, causing all the problems. I fixed this bug, and sent a patch to the tuxonice-devel mailing list. As of now, it is already in the git repository ready to be released with next bugfix release of TuxOnIce.
That is everything for today, but as I already noted I am using LVM on my system (except root partition) and also use fbsplash for nice animations while rebooting. I am using initrd for this, and I will have another post on that topic.
geeky as always :D