Efficient Coding of Contact Matrices

// Data Transmission // Information and Communication Technology
Ref-Nr: 17209

Abstract

The representation and coding of genomic annotations are currently being standardised by ISO/IEC JTC 1/SC 29/WG 8 (MPEG-G). The present invention was introduced into the standardisation process and relates to a method for the efficient coding of contact matrices.

Fig. 1: File size of HiC data compressed using Cooler, LZMA, Bzip2 and our approach. With multiresolution and onthefly normalization, we can further reduce the file size by up to 5 times smaller.

background

The invention is based on the coding of contact matrices. The chromosome coordinates describe the start and end positions of a gene (or genetic element) on a chromosome. A pair of chromosome coordinates that are close to each other is called a contact. A contact matrix can be represented in the form of a sparse matrix. The matrix contains not only the number of contacts within a given genomic region, but also the normalised value of this contact. The invention is an extension of the invention “Method for the Coding of Contact Matrix” (German patent application pending (Technology description, https://www.ezn.de/ezn-patent/method-for-the-coding-of-contact-matrix/).

Innovation / Solution

According to the invention, a highly efficient coding of contact matrices and new functionalities are proposed. The extended structure contains new elements such as contact matrix header and bin payload. The header contains information such as the bin interval of the contact matrix tiles, the list of chromosomes with its corresponding name and length, the sample names, and the name of methods of the normalization done to the contact matrix tile. Normalization method could be for the on-the-fly and precomputed normalized one. The bin payload contains the interval multiplier. This is necessary in the case of multi-interval and the weights correspond to the higher interval. The weights for each on-the-fly normalization are also stored in the bin payload. This is done as the weights do not require much space and therefore no compression is neccessary. Additionally, the term interval is used instead of resolution to avoid confusion. The reason is higher resolution means better details yet for a contact matrix it becomes less detailed.

Benefits

Better structure to support multiple samples Support for on-the-fly normalized value computation for efficient storage of normalized contact matrix Technique for multi-interval

fields of application

Coding of genomic data

You can close this window. You can find your search results in the previous window

Top
EZN - Navigation