[ExI] 1 mm^3 of brain

BillK pharos at gmail.com
Mon May 20 16:11:40 UTC 2024


On Mon, 20 May 2024 at 15:26, Will Steinberg via extropy-chat <
extropy-chat at lists.extropy.org> wrote:
>
> @john: the human genome IS compressed data.
> _______________________________________________


Hi Will

Genetic researchers can compress the human genome.
BillK

I asked -
Is the human genome already compressed data, or can it be stored in a
compressed format?

Perplexity Pro AI -

The human genome is not inherently compressed data, but it can be stored in
a compressed format. The vast amount of genomic data generated by modern
sequencing technologies necessitates efficient storage solutions, and
various specialized compression algorithms have been developed to address
this need.

### Compression Methods for Genomic Data

1. **SparkGC**:
   - SparkGC is an Apache Spark-based genome compression method designed to
handle large collections of genomes efficiently. It uses Resilient
Distributed Datasets (RDD) transformations to keep data active in memory
throughout the compression process, significantly improving compression
speed and scalability. SparkGC can compress 1100 human genomes on a common
computer with just 24 GB of RAM, achieving the best compression ratio among
state-of-the-art methods[1].

2. **Lossless DNA Compression Algorithm**:
   - This method specializes in compressing nucleotide sequence data in
FASTA/multi-FASTA formats. It separates the headers and uses 7-Zip for
compression, achieving a high compression ratio by leveraging the specific
characteristics of DNA sequences[2].

3. **LZ77 and CTW Algorithms**:
   - A method combining the LZ77 algorithm with the context tree weighting
(CTW) algorithm has been proposed to compress DNA data. This approach first
converts DNA sequences into intermediate files before applying LZ77,
resulting in significant reductions in compression and decompression times
while maintaining a high compression rate[3].

4. **Neural Network-Based Compression (GeCo3)**:
   - GeCo3 uses a neural network metamodel to improve the compression ratio
of DNA sequences. It benchmarks both reference-free and referential
compression, showing consistent improvements over other state-of-the-art
DNA compressors. This method is particularly useful for long-term storage
and sequence analysis[4].

5. **FQSqueezer**:
   - FQSqueezer is a novel algorithm for compressing sequencing data,
particularly FASTQ files. It uses prediction by partial matching (PPM) and
dynamic Markov coder (DMC) algorithms, achieving better compression ratios
than existing tools. However, it requires significant memory and time
resources[5].

### Conclusion

The human genome can indeed be stored in a compressed format using
specialized algorithms designed to handle the unique characteristics of
genomic data. These methods significantly reduce the storage requirements
and improve the efficiency of data handling in genomic research and
applications.

Citations:
[1]
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04825-5
[2] https://www.mdpi.com/1999-4893/13/4/99
[3] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7688149/
[4] https://academic.oup.com/gigascience/article/9/11/giaa119/5974977
[5] https://www.nature.com/articles/s41598-020-57452-6
[6]
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05500-z
[7] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9337911/
[8] https://www.reddit.com/r/askscience/comments/637gyu/is_dna_compressed/
[9] https://arxiv.org/abs/2006.02232
[10] https://www.sciencedirect.com/science/article/pii/S2666389922001738
[11]
https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2023.1260531/full
[12] https://en.wikipedia.org/wiki/Compression_of_genomic_sequencing_data
[13] https://github.com/rajatdiptabiswas/dna-compression
[14] https://arxiv.org/abs/2401.14025
[15] https://www.sciencedirect.com/science/article/pii/S1877050916318762
[16]
https://www.illumina.com/documents/products/whitepapers/whitepaper_datacompression.pdf
[17] https://arxiv.org/abs/2311.01431
[18] https://blog.chiariglione.org/genome-is-digital-and-can-be-compressed/
[19] https://pubmed.ncbi.nlm.nih.gov/37018338/
[20]
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02443-7
------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.extropy.org/pipermail/extropy-chat/attachments/20240520/708d21f9/attachment.htm>


More information about the extropy-chat mailing list