GitHub - mdhaduk/Huffman_Coding: Developed a program that performs Huﬀman coding and compression on files of any type

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.vscode		.vscode
.DS_Store		.DS_Store
Decompress.java		Decompress.java
HuffmanTree.java		HuffmanTree.java
PriorityQueue314.java		PriorityQueue314.java
README.txt		README.txt
SimpleHuffProcessor.java		SimpleHuffProcessor.java

Repository files navigation

What kinds of file lead to lots of compressions?
    - The types of files that lead to lots of compression include:
        LARGER, more REDUNDANT .txt files. Files such as calagary and
        BooksANDHTML are both rather large text files that compressed
        on average 41%. Interestingly, the largest text file which I
        compressed was the ciaFactbook2008, though it only compressed 
        about 23%. After quickly skimming through the text file, I believe
        this due to the file containing many statistics and numbers instead
        of just words, like the former two files. This would cause the 
        respective HuffmanTree to not contain many short leaf-nodes, or 
        more simply put, short encodings. As a result, the compressed 
        file is not gonna be much smaller, and this proves why REDUNDANT
        .txt files are more apt for lots of compression.

What kind of files had little or no compression?
    - The types of files that lead to little or no compression include:
        SMALLER, UNIQUE .txt and .tif (image) files. Files such as 'smallTxt.txt'
        and waterloo, which contains .tif image files, either negatively
        compressed, meaning the resulting compressed file was actually LARGER
        than the uncompressed file, OR the file's compression was simply marginal.
        It's obvious the 'smallTxt.txt.hf' file will be LARGER because the header
        itself created in the output file is much LARGER than the uncompressed file.
        For waterloo, I believe the reason the image files are compressed very
        marginally is because the file is very random. More specifically, every
        part of the file likely corresponds to a distinct part of the image, and 
        each part will therefore be UNIQUE. Consequently, the HuffmanTree will
        be very tall, not having many leaf nodes with short encodings, resulting
        in marginal compression.

What happens when you try and compress a huffman code file?    
    - When you compress a huffman code file, the total percent  
      compression is very small. I feel this is quite unintuitive,
      but using an example helps me understand this idea better. In
      example, the 'ciaFactbook2008.txt' compresses about 23%. If we 
      compress the resulting 'ciaFactbook2008.txt.hf', what exactly 
      is being compressed? It would be the UNIQUE encodings of '1's and
      '0's that we use to represent REDUNDANT characters in the orginal
      'ciaFactbook2008.txt' file. Consequently, compressing this generally 
      UNIQUE output file should NOT result in a lot of compression, which 
      is quite evident as the 'ciaFactbook2008.txt.hf' file only compressed
      about 1%... LULW.
    

Benchmarking Results:


Files/Directories   Total Bytes Read    Total Compressed Bytes   Total % Compression

smallTxt.txt            26                      1044                -3915.385
calgary                 3251493                 1845571             43.239 
BooksANDHTML            9788581                 5828112             40.460  
waterloo                12466304                10205282            18.137
ciaFactbook2008.txt     15666027                11988591            23.474
ciaFactbook2008.txt.hf  6028799                 5959792             1.145