<div dir="ltr">On Mon, 20 May 2024 at 15:26, Will Steinberg via extropy-chat <<a href="mailto:extropy-chat@lists.extropy.org">extropy-chat@lists.extropy.org</a>> wrote:<br>><br>> @john: the human genome IS compressed data.<br>> _______________________________________________<br><br><br>Hi Will<br><br>Genetic researchers can compress the human genome.<br>BillK<br><br>I asked -<br>Is the human genome already compressed data, or can it be stored in a compressed format?<br><div><br></div><div>Perplexity Pro AI -</div><div><br></div><div>The human genome is not inherently compressed data, but it can be stored in a compressed format. The vast amount of genomic data generated by modern sequencing technologies necessitates efficient storage solutions, and various specialized compression algorithms have been developed to address this need.<br><br>### Compression Methods for Genomic Data<br><br>1. **SparkGC**:<br>   - SparkGC is an Apache Spark-based genome compression method designed to handle large collections of genomes efficiently. It uses Resilient Distributed Datasets (RDD) transformations to keep data active in memory throughout the compression process, significantly improving compression speed and scalability. SparkGC can compress 1100 human genomes on a common computer with just 24 GB of RAM, achieving the best compression ratio among state-of-the-art methods[1].<br><br>2. **Lossless DNA Compression Algorithm**:<br>   - This method specializes in compressing nucleotide sequence data in FASTA/multi-FASTA formats. It separates the headers and uses 7-Zip for compression, achieving a high compression ratio by leveraging the specific characteristics of DNA sequences[2].<br><br>3. **LZ77 and CTW Algorithms**:<br>   - A method combining the LZ77 algorithm with the context tree weighting (CTW) algorithm has been proposed to compress DNA data. This approach first converts DNA sequences into intermediate files before applying LZ77, resulting in significant reductions in compression and decompression times while maintaining a high compression rate[3].<br><br>4. **Neural Network-Based Compression (GeCo3)**:<br>   - GeCo3 uses a neural network metamodel to improve the compression ratio of DNA sequences. It benchmarks both reference-free and referential compression, showing consistent improvements over other state-of-the-art DNA compressors. This method is particularly useful for long-term storage and sequence analysis[4].<br><br>5. **FQSqueezer**:<br>   - FQSqueezer is a novel algorithm for compressing sequencing data, particularly FASTQ files. It uses prediction by partial matching (PPM) and dynamic Markov coder (DMC) algorithms, achieving better compression ratios than existing tools. However, it requires significant memory and time resources[5].<br><br>### Conclusion<br><br>The human genome can indeed be stored in a compressed format using specialized algorithms designed to handle the unique characteristics of genomic data. These methods significantly reduce the storage requirements and improve the efficiency of data handling in genomic research and applications.<br><br>Citations:<br>[1] <a href="https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04825-5">https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04825-5</a><br>[2] <a href="https://www.mdpi.com/1999-4893/13/4/99">https://www.mdpi.com/1999-4893/13/4/99</a><br>[3] <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7688149/">https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7688149/</a><br>[4] <a href="https://academic.oup.com/gigascience/article/9/11/giaa119/5974977">https://academic.oup.com/gigascience/article/9/11/giaa119/5974977</a><br>[5] <a href="https://www.nature.com/articles/s41598-020-57452-6">https://www.nature.com/articles/s41598-020-57452-6</a><br>[6] <a href="https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05500-z">https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05500-z</a><br>[7] <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9337911/">https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9337911/</a><br>[8] <a href="https://www.reddit.com/r/askscience/comments/637gyu/is_dna_compressed/">https://www.reddit.com/r/askscience/comments/637gyu/is_dna_compressed/</a><br>[9] <a href="https://arxiv.org/abs/2006.02232">https://arxiv.org/abs/2006.02232</a><br>[10] <a href="https://www.sciencedirect.com/science/article/pii/S2666389922001738">https://www.sciencedirect.com/science/article/pii/S2666389922001738</a><br>[11] <a href="https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2023.1260531/full">https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2023.1260531/full</a><br>[12] <a href="https://en.wikipedia.org/wiki/Compression_of_genomic_sequencing_data">https://en.wikipedia.org/wiki/Compression_of_genomic_sequencing_data</a><br>[13] <a href="https://github.com/rajatdiptabiswas/dna-compression">https://github.com/rajatdiptabiswas/dna-compression</a><br>[14] <a href="https://arxiv.org/abs/2401.14025">https://arxiv.org/abs/2401.14025</a><br>[15] <a href="https://www.sciencedirect.com/science/article/pii/S1877050916318762">https://www.sciencedirect.com/science/article/pii/S1877050916318762</a><br>[16] <a href="https://www.illumina.com/documents/products/whitepapers/whitepaper_datacompression.pdf">https://www.illumina.com/documents/products/whitepapers/whitepaper_datacompression.pdf</a><br>[17] <a href="https://arxiv.org/abs/2311.01431">https://arxiv.org/abs/2311.01431</a><br>[18] <a href="https://blog.chiariglione.org/genome-is-digital-and-can-be-compressed/">https://blog.chiariglione.org/genome-is-digital-and-can-be-compressed/</a><br>[19] <a href="https://pubmed.ncbi.nlm.nih.gov/37018338/">https://pubmed.ncbi.nlm.nih.gov/37018338/</a><br>[20] <a href="https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02443-7">https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02443-7</a></div>------------------------------------<br></div>