# corpuscompression **Repository Path**: mirrors_spullara/corpuscompression ## Basic Information - **Project Name**: corpuscompression - **Description**: Achieve better compression for small objects with a predefined corpus - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-09-26 - **Last Updated**: 2026-03-22 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README Copyright 2011 Sam Pullara Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. The GZIP algorithm uses a corpus built up from the file that you are compressing. If you are trying to compress very small objects, you have the issue that overhead and how little it knows working against you. This library lets you predefine a corpus for compression and decompression which can give you the benefits of a custom compression algorithm without the effort. There are two test examples: Corpus: this is a test of the emergency broadcast system. this is only a test. Data: testing... emergency! but I'm broadcasting. Result: 32 characters compressed vs 43 uncompressed Corpus: links.txt Data: http://twitpic.com/7i7o21 Result: 12 characters compressed vs 25 uncompressed vs 45 compressed using GZIP directly Corpus: links.txt Data: http://oink.com/i/4eb383adee44744a9c000d0c Result: 58 characters compressed vs 67 uncompressed vs 86 compressed using GZIP directly Corpus: tweets.txt Data: targettweet.txt Result: 390 characters compressed vs 1990 uncompressed vs 891 compressed using GZIP directly I think there is a big gap between the ultimate utility and the current random corpus of URLs that I grabbed. Optimizing and evolving that corpus over time would likely be necessary.