Aug
31
2006
0

Choreography and treadmills

Four people, eight treadmills, a video camera, and creative thinking:

Bravo. The group OK Go, can be found at okgo.net.

Written by kunau in: general interest
Aug
31
2006
0

HAPMAP Linkage disequilibrium, in MySQL

I’m trying to upload HAPMAP LD data into MySQL. The table is named CEU after the population group. Here is the description:


+------------+-------------+------+-----+---------+-------+
| Field      | Type        | Null | Key | Default | Extra |
+------------+-------------+------+-----+---------+-------+
| chromosome | varchar(2)  |      |     | 0       |       |
| marker1    | int(11)     |      |     | 0       |       |
| marker2    | int(11)     |      |     | 0       |       |
| population | varchar(5)  |      | PRI |         |       |
| rsmarker1  | varchar(12) |      | PRI |         |       |
| rsmarker2  | varchar(12) |      | PRI |         |       |
| dprime     | double      |      |     | 0       |       |
| rsquare    | double      |      |     | 0       |       |
| LOD        | double      |      |     | 0       |       |
| fbin       | int(11)     |      |     | 0       |       |
+------------+-------------+------+-----+---------+-------+

One file for each of twenty-two chromosomes, X and Y are not needed for this exercise. Chromosome one holds nearly 50 million rows of data in a 2.5GB file. This collection of files creates such a large and onerous footprint that there are very few desktop machines able to decompress the files, let alone manipulate them in some useful and timely way. You are invited to try this at home. (Data available here.) Problems like these call for appropriate machines and databases. Our database server is an eight processor, 32GB/RAM Sun Microsystems v880 with a fiber-channel attached 4.5TB Sun StorEdge array. It should be up to the task. But is the database?

The database trick will be to complete the import without blowing off the end of MySQL table limits. According to documentation there is a 4.2B row limit on MYISAM table types. But there is a 4GB table size limit I think we’ll run into first. Table status currently reads:


mysql> show table status like 'ceu' \G
*************************** 1. row ***************************
           Name: ceu
         Engine: MyISAM
        Version: 9
     Row_format: Dynamic
           Rows: 11021586
 Avg_row_length: 70
    Data_length: 776406388
Max_data_length: 4294967295
   Index_length: 124839936
      Data_free: 0
 Auto_increment: NULL
    Create_time: 2006-08-31 15:26:56
    Update_time: 2006-08-31 21:07:54
     Check_time: NULL
      Collation: utf8_general_ci
       Checksum: NULL
 Create_options:
        Comment:
1 row in set (0.00 sec)

One possible fix might be to alter the table with the following command:


mysql> alter table ceu max_rows = 200000000000 avg_row_length = 60 ;
Query OK, 11192144 rows affected (24 min 40.44 sec)
Records: 11192144  Duplicates: 0  Warnings: 0

mysql> show table status like 'ceu' \G
*************************** 1. row ***************************
           Name: ceu
         Engine: MyISAM
        Version: 9
     Row_format: Dynamic
           Rows: 11249635
 Avg_row_length: 70
    Data_length: 792599672
Max_data_length: 1099511627775
   Index_length: 139523072
      Data_free: 0
 Auto_increment: NULL
    Create_time: 2006-08-31 21:11:00
    Update_time: 2006-08-31 21:36:50
     Check_time: NULL
      Collation: utf8_general_ci
       Checksum: NULL
 Create_options: max_rows=2431504384 avg_row_length=60
        Comment:
1 row in set (0.01 sec)

This done I’m letting the import run over night. It may take some tweaking to get right. Once the data is loaded we’ll need to build indexes to speed the query process. I can already think of a number of things I would do differently if this were loaded into Oracle. I may end up in Oracle before this exercise is over. Regardless, it is an interesting test of databases at their limits. So far:

Turns out I don’t think I have enough room for all this data on the current partition. If I extrapolate from the currently installed data I will run out some time tomorrow afternoon.

New space has been found and I’m trying again. If this database gets larger than half a terabyte we’ll have to rethink using MySQL as the partition for MySQL is limited.

Here are the chromosomes currently or partially loaded:


mysql> select distinct(chromosome) from ceu order by chromosome;
+------------+
| chromosome |
+------------+
|          1 |
|          2 |
|          3 |
|          4 |
|          5 |
|          6 |
|          8 |
|          9 |
|         10 |
|         11 |
|         12 |
|         13 |
|         14 |
|         15 |
|         16 |
|         17 |
|         18 |
|         19 |
|         20 |
|         21 |
|         22 |
+------------+
21 rows in set (1 hour 15 min 39.11 sec)

Only one more chromosome to go. These imports should complete some time Tuesday.

Finished. This oversized table is complete:


mysql> select count(*) from ceu;
+-----------+
| count(*)  |
+-----------+
| 696238513 |
+-----------+
1 row in set (0.01 sec)

Queries may look something like this:


mysql>  SELECT count(*) FROM suzanne.ceu c where ( c.rsquare > 0.8 and
	c.rsmarker1 = "rs10473282" or c.rsmarker2 = "rs10473282" );
+----------+
| count(*) |
+----------+
|      179 |
+----------+
1 row in set (1 hour 12 min 8.17 sec)

In an effort to speed things up a bit, I’ve added:


ALTER TABLE `suzanne`.`ceu` ADD INDEX `rsquare`(`rsquare`),
 ADD INDEX `dprime`(`dprime`);

It will take some time to build these indexes and then we’ll run the query again. Reading about other methods of modifying the MySQL database engine to bette accommodate data-sets of this size.

A natural partition of this data might be along the chromosome divide but that doesn’t support the type of questions we need to ask. The real utility of a database like this will be having all the parts in one place.

Where are X and Y? Not sure of the biological reasons for their omission.

Written by kunau in: databases
Aug
29
2006
0

Open Suse 10.1 under Parallels Desktop for Macintosh

I downloaded the DVD image via torrent:
suse.png

SUSE-Linux-10.1-DVD-i386-10.1

I verified the checksum of the disc.

When I boot and run the installer in a newly created Parallels VM, the install progresses normally until I get to the installer screen that lets you configure the software installed. The “Overview” tab shows “Partitioning,” “Software” and “Language”. But the Software section says “No catalog found at ‘dvd:///?devices%3d%2fdev%2fhdb’”. The same under “Software” section in the “Expert” tab. The Software section shows no content.

This was corrected by aborting from Yast setup screen to top the installation and bringing up the text console. From the text console in the middle of the screen, select load modules, and use this function to load the UDF module.

Picture 1.png Picture 2.png
Picture 3.png Picture 4.png
Picture 5.png

Then navigate back through the text menus and restart the install. This corrects the software catalog problem, the install appears to continue as expected until the ‘Finishing Basic installation’ stage where this error is produced ‘An error occurred during initrd creation’. Has anyone seen this?:

Picture 6.png

As Kevin Rose from diggnation says, ‘ok, moving on’.

Written by kunau in: LINUX, Macintosh, tools
Aug
25
2006
0

Get your kicks on all 66!

All sixty-six cpus are working at once! Over 20,000 jobs have been submitted by Gridstuffer and processed by the Xgrid system. A true distributed supercomputer.

Xgrid-all-66.png

Aug
24
2006
0

XDS affinity domain project

We have begun working with Kevin Peterson and his ePCRN project.

XDS is a protocol for Cross-enterprise Document Sharing. The XDS profile defined by Integrating the Healthcare Enterprise (IHE) defines a standards based method for managing the sharing of documents among healthcare service providers. The XDS profile defines a set of actors and transactions which allow documents to be registered in a central registry, queried and retrieved.

An XDS Affinity Domain is a group of healthcare service providers that have agreed to work together using a common set of policies and infrastructure. An XDS Affinity Domain may be a Regional Healthcare Information Organization (RHIO) that defines a group of enterprises which will have a protocol for sharing medical data. Though an RHIO is regionally located, an XDS Affinity Domain does not require regional proximity.

An XDS Affinity Domain requires a protocol for sharing medical data. This XDS profile is document based and includes a single Document Registry containing an entry for every registered document. The entry contains a set of metadata elements. Configuring an XDS Affinity Domain requires first identifying the location of the Document Registry. Second, defining policies decisions agreeable to all participating service providers. Policies include document format, values for metadata elements of document entries, coding systems and terminology.

We are interested in building an Open Source XDS server with the following components:

patient ID server
doc ID server
audit repository
data repository

I’ve attended several telephone and video conferences on the subject and spoken with a researcher at IBM Almaden Research Center. We are in the process of specifying just what it would take to build such a service.

Written by kunau in: databases, tools

Powered by WordPress. 13 queries in 2.849 seconds.