Fedora Linux Support Community & Resources Center
Sections ›› Home | Forums | Guidelines | Forum Help | Fedora FAQ | Fedora News 

Go Back   FedoraForum.org > The Community Lounge > Programming

Programming A place to discuss programming and development

Reply
 
Thread Tools Search this Thread Display Modes
  #1  
Old 2009-11-03, 09:12 PM CST
kramulous Offline
Registered User
 
Join Date: Dec 2007
Location: Brisbane, Australia
Posts: 45
linuxfedorafirefox
C++: Reading archives inside archives

Hi,

I currently have about 9000 zip files, with each of those zip files containing roughly 5000-10000 zip files and each of those zip files contain about 1000 csv files.

I need to strip mine some data from those csv files. Preferably using C++. I don't want to be doing any writing to disk cause that may take a while.

Currently, I'm using libzip

Code (dodgy hack) so far:

Code:
int readMasterZipFile(string zipFile) 
{
	struct zip *z;
    struct zip_stat st;
    struct entry *e;
    int numberOfFiles;
    int n;
    int err;
    char errstr[1024];
    off_t size;
    unsigned int crc;
    char b[8192];
    struct zip_file *zf;
    

    if ((z=zip_open(zipFile.c_str(), 0, &err)) == NULL) 
	{
	    zip_error_to_str(errstr, sizeof(errstr), err, errno);
	    fprintf(stderr, "Cannot open zip archive `%s': %s\n", zipFile.c_str(), errstr);
	    return -1;
	}

	numberOfFiles = zip_get_num_files(z);
	fprintf(stdout, "The number of files in zip = %d\n", numberOfFiles);

	// Stop testing with all the files - just one for now
	int test = 1;//numberOfFiles;
	for ( int j = 0; j < test; j++ ) 
	{
	    //zip_stat_index(z, j, 0, &st);	    
	    // Need to open this zip file
	    zf = zip_fopen_index(z, j, ZIP_FL_COMPRESSED);

	    while ((n=zip_fread(zf, b, sizeof(b))) > 0)
	    {
	    	
	    	
	    }
	    zip_fclose(zf);
	}
	// Close this zip file
	zip_close(z);
	// For now, useless return
	return 1;
}
My problem is that I need to read the archive within the archive. I can read csv files straight out of a single zip.

Anybody offer any solutions/ideas/alternatives?
Reply With Quote
  #2  
Old 2009-11-03, 10:17 PM CST
jpollard Offline
Registered User
 
Join Date: Aug 2009
Location: Waldorf, Maryland
Posts: 303
linuxfedorafirefox
You might consider reading the file to extract a single zip file... Then sending that zip
file to another process (reading from a pipe) to process that file...

If you are collating data extracted you could have them all sending their extracted
data to a collection process...

Clumsy, but it is recursive, and could process nearly anything you want. The
top level process would start the collection process, and then begin processing either
a pipe, or a file (specified on the command line).

You may be able to do this with threads, or other mechanisms, but I think they would
get too complex to maintain rather quickly (but they would execute faster).
Reply With Quote
  #3  
Old 2009-11-04, 03:14 PM CST
kramulous Offline
Registered User
 
Join Date: Dec 2007
Location: Brisbane, Australia
Posts: 45
linuxfedorafirefox
Ok, I got it. I changed to Java; primarily because of the ZipInputStream class

It is not elegant, but does the job. Now to find out how it goes performance wise. If anybody can spot some crappy memory management here, please point it out. I haven't kept up2date with Java.

Code:
public void read2NestedZipFile(String fileName)
    {
        ZipFile zipFile = null;
        ZipInputStream zipReader = null;
        ZipEntry zipEntry = null;
        ZipEntry anotherZipEntry = null;
        try
        {
            // Open the encapsulating zipfile
            zipFile = new ZipFile(fileName);
            // Process each file found inside it
            for (Enumeration e = zipFile.entries(); e.hasMoreElements();)
            {
                // Grab a reference to the file inside the master zip file
                zipEntry = (ZipEntry) e.nextElement();
                // use ZipInputStream to process this file - assume another zip file for now
                zipReader = new ZipInputStream(zipFile.getInputStream(zipEntry));
                while ((anotherZipEntry = zipReader.getNextEntry()) != null)
                {
                    // Got the file!
                    System.out.println(anotherZipEntry.getName());
                    // TODO: Process the CSV
                    //
                }
                zipReader.close();
            }
        } catch (IOException ioe)
        {
            System.out.println("An IOException occurred: " + ioe.getMessage());
        } finally
        {
            if (zipFile != null)
            {
                try
                {
                    zipFile.close();
                } catch (IOException ioe)
                {
                }
            }
        }
    }
I would still like to see a c++ version someday. I just didn't have the time to search or write.
Reply With Quote
  #4  
Old 2009-11-13, 05:28 PM CST
kramulous Offline
Registered User
 
Join Date: Dec 2007
Location: Brisbane, Australia
Posts: 45
macosfirefox
Just an update in case anybody follows this path.

I ditched the java solution. It inspected the inner zip files perfectly, but the text processing was unacceptably slow.

I went back to the original C++ implementation, but unpacked the top level zip file into a RAM drive. Performance was acceptable. I fear it won't scale well though. For this problem, fine.

I also found that it is much better to have a compression other than zip. The random access nature of the file meant that the whole thing had to be read before processing could begin. If it were .tar.gz or something else, I could have process the stream directly from the read off disk(s).
Reply With Quote
Reply

Tags
c++ libzip zip-of-zips

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
rar archives frankje Software 2 2007-04-19 12:40 AM CDT
Archives David McCormick gmane.linux.redhat.fedora.general 3 2006-08-14 08:10 AM CDT
Searching archives Anne Wilson gmane.linux.redhat.fedora.general 16 2006-03-24 03:40 PM CST
Searching archives Anne Wilson gmane.linux.redhat.fedora.general 11 2006-03-23 10:50 PM CST
Searchable Archives? IEEE Consulting gmane.linux.redhat.fedora.general 6 2005-02-02 02:22 PM CST

Automatic Translations (Powered by Powered by Google):
Afrikaans Albanian Arabic Belarusian Bulgarian Catalan Chinese Croatian Czech Danish Dutch English Estonian Filipino Finnish French Galician German Greek Hebrew Hindi Hungarian Icelandic Indonesian Italian Japanese Korean Latvian Lithuanian Macedonian Malay Maltese Norwegian Persian Polish Portuguese Romanian Russian Serbian Slovak Slovenian Spanish Swahili Swedish Taiwanese Thai Turkish Ukrainian Vietnamese Yiddish

All times are GMT -7. The time now is 07:32 AM CST.

TopSubscribe to XML RSS for all Threads in all ForumsFedoraForumDotOrg Archive
Hosting provided by ThePlanet



All trademarks, and forum posts in this site are property of their respective owner(s).

FedoraForum.org is privately owned and is not directly sponsored by the Fedora Project or Red Hat, Inc.

Privacy Policy | Term of Use | Posting Guidelines | Archive | Contact | Founding Members
Designed By Ewdison Then | Powered by vBulletin ©2000-2009, Jelsoft Enterprises Ltd.
FedoraForum is Powered by Open Source Projects and Products
Translations delivered by vBET Translator