PDA

View Full Version : open big files in C++!!


fizy
10th May 2008, 11:41 PM
I need to test a file transfer with big large files.
I've no problems with small files, where the file's size is under 2GB (that's because of integer types on 32bit environments...((2^32)/2)-1=MAXSIZE=2GB with signed integer, without signed integer you have as max size (2^32)-1 ).

I've try to open a big dual-layer DVD ISO, about 6.5 GB!!!!
I've look with google and i've write this code:

#include <stdio.h>

void test()
{
FILE *f=fopen64("/somewhere/file.iso", "rb");
off64_t file_size=getfilesize(f);
fclose(f);
}

off64_t getfilesize(FILE *f)
{
off64_t result = -1;
off64_t p=ftello64(f);//get the current position
if(fseeko64(f, 0, SEEK_END)==0)//go on the end
{
result=ftello64(f);//get current position=file size!
fseeko64(f, p, SEEK_SET);//set back old position
}
return result;
}

PS: off64_t is defined in stdio.h as "long long"..a 64 bit integer!However, this code works only in my dreams, because i've look with the debugger and the value in file_size is wrong, even in small files :(

majikthise
11th May 2008, 03:35 AM
Are you attempting to load the whole image into memory? Cos that's kinda dumb.

unsigned long long would be preferable.

Just my 2 cents.

majikthise
11th May 2008, 03:49 AM

BTW have you #defined __USE_LARGEFILE64 ?

stevea
11th May 2008, 06:50 AM
Your example doesn't compile - errors so you've left out something.

This revisions works but it's fugly as sin.

#include <stdio.h>
#include <unistd.h>

off64_t getfilesize(FILE *f)
{
off64_t result = -1;
off64_t p=ftello64(f);//get the current position

if(fseeko64(f, 0, SEEK_END)==0)//go on the end
{
result=ftello64(f);//get current position=file size!
fseeko64(f, p, SEEK_SET);//set back old position
}
return result;
}

void test(char *fname)
{
FILE *f = fopen64(fname, "r");
off64_t file_size = getfilesize(f);
fclose(f);
printf("file_size = %lld.\n", file_size);
}

int
main(int argc, char *argv[])
{
test(argv[1]);
return 0;
}

For one thing you should always check return codes - or at least for dubious operations.

$ g++ test.cc -o test
$ ./test Fedora-9-Alpha-x86_64-DVD.iso
file_size = 4206884864.




Which is correct.

======
This is a better approach for a 32bit architecture:
#define _LARGEFILE_SOURCE 1
#define _LARGEFILE64_SOURCE 1
#define _FILE_OFFSET_BITS 64
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>


off_t getfilesize(FILE *f)
{
struct stat xstat;
fstat(fileno(f), &xstat);
return xstat.st_size;
}

void test(char *fname)
{
FILE *f = fopen(fname, "r");
off_t file_size = getfilesize(f);
fclose(f);
printf("file_size = %lld.\n", file_size);
}

int
main(int argc, char *argv[])
{
printf("sizeof(off_t) = %d.\n", sizeof(off_t));
test(argv[1]);
return 0;
}

Firewing1
11th May 2008, 07:34 AM
You should use a buffer so that you can read the large file in smaller chunks.

copy.cpp:
#include <fstream>
#include <iostream>
#include <stdio.h>
#include <string.h>
#include "copy.h"
using namespace std;
int main(int argc, char **argv) {
add_file(argv[1], argv[2]);
return 0;
}
int add_file(const char *src,
const char *dst) {
/* Adds a single file */
ifstream infile;
ofstream outfile;
char *buffer;
int src_size;
int status;
int count;
printf("Copying %s to %s\n", src, dst);
count = 0;
status = STATUS_OK;
// open at end so tellg() gets file size
infile.open(src, ios::binary|ios::ate);
outfile.open(dst, ios::binary);
src_size = infile.tellg();
// seek to beginning since we opened at end
infile.seekg (0, ios::beg);
//printf("Source size: %d\n", src_size);
buffer = new char[BUFFER_SIZE];
while ( (status == STATUS_OK) && (!infile.eof()) ) {
// reads BUFFER_SIZE of data info buffer
infile.read(buffer, BUFFER_SIZE);
outfile.write(buffer, infile.gcount());
count += infile.gcount();
//printf("Current progress: %dMB\n", (count/1024/1024));
}
infile.close();
outfile.close();
// deletes buffer and its contents
delete[] buffer;
return(status);
}
copy.h:
#define BUFFER_SIZE 32768

#define STATUS_OK 0
#define STATUS_ERROR 1
#define STATUS_NO_DISK_SPACE 2


int add_file

(const char *src,

const char *dst);

You can now compile the program using
g++ -o copy copy.cpp

"copy" takes two arguments, source and destination. Note that "./copy source ." doesn't work, you have to actually provide a filename ie "./copy source ./dest". There's definitely room for improvement, but that's my first C++ program :)

Firewing1

fizy
11th May 2008, 11:42 AM
Thank you majikthise, stevea! and Firewing1... but there are something of strange around here...lol

Well, of course i'm not going to load the full file (6.5GB!) into memory, i'll read small pieces of 128KB or 256KB (or a custom size) and then send them from my socket-client... the server will just append all the incoming pieces to a file, so i can reassemble it and get a copy of the original file...

Your code, Firewing1, is good; but of course it will works only with small files under 2GB. As you are using ifstream and ofstream...they supports only small files as they (like you, with the "int count;" variable) are using a signed-32int to manage file....and yes, this is my problem, i need to deal with files over 4GB (>unsigned 32int). So my code needs signed or unsigned int64 ("long long" or "unsigned long long").
Signed int64 aren't bad, because i can use "-1" as flag to set an "invalid size".

Thanks to stevea, i've look at the file "sys/stat.h"... this include is new for me :D

I've compile the code under linux, and now it works fine (maybe thanks to the #defines?)

I've no idea! If i run your code (both versions) under Windows i have the same output as result:
sizeof(off_t) = 4.
file_size = -1803255808. /*What the heck!?!?lol */
Under linux both versions work:
sizeof(off_t) = 8.
file_size = 6786678784. /*YEAHHH :D */

I don't understand why it doesn't works under windows (i use mingw)!

How can i write this code, as portable code?
Btw to read and write from/to the file, can i use fwrite and fread as usual? is it?
What other interesting functions there are in sys/stat.h? :D

majikthise
11th May 2008, 12:05 PM
sizeof(off_t) = 4.
file_size = -1803255808. /*What the heck!?!?lol */if it was unsigned value, it would be 3.68GB which is still way off the mark. :confused:
Under linux both versions work:
sizeof(off_t) = 8.
file_size = 6786678784. /*YEAHHH :D */

I don't understand why it doesn't works under windows (i use mingw)!I didn't know mingw was 64bit capable. Although you would need to be using 64bit Windows OS too.

fizy
11th May 2008, 12:40 PM
if it was unsigned value, it would be 3.68GB which is still way off the mark. :confused:
off_t are of 32bit in mingw (under Windows). But i've replaced off_t with long long, so even if it prints "sizeof(off_t) = 4.", i was using 64integers in my functions :cool: though off64_t are fixed on 64bits on windows too!

I didn't know mingw was 64bit capable. Although you would need to be using 64bit Windows OS too.What!?!? :eek:
No no, you can use 64bit integer under windows too. You don't need a 64bit OS to work with 64bit numbers ;) . This is in the same way as i am using 64bit integers under a 32bit Fedora OS.

I just don't know why the same code, works under linux, but not under windows.... if the code is the same...i would expect the same result :rolleyes:

majikthise
11th May 2008, 12:57 PM
Ints are dependent on the natural size of the cpu which in turn is also dependent on the host OS. In this case (mingw) is 32bits (i.e. sizeof (off_t) =4) - Don't take my word for it check any good book on C/C++ ;)

Maybe you're missing a compiler switch for building 64bit binaries, which *will* give you 64bit ints

fizy
11th May 2008, 01:15 PM
Ints are dependent on the natural size of the cpu which in turn is also dependent on the host OS. In this case (mingw) is 32bits (i.e. sizeof (off_t) =4) - Don't take my word for it check any good book on C/C++ ;)

Maybe you're missing a compiler switch for building 64bit binaries, which *will* give you 64bit ints
lol, i can answer to you; but i hope somebody can and will answer to my questions too! :D :D

BTW:
Yes, "Int"s are (usually) based on the CPU's working word (as for example 8086 -> AX/DI/SI/BX... -> 16bit, i386 -> EAX/EDX/ECX/EBX/EIP/... -> 32bit, etc....).
I don't need to build "64bit binaries". I'm okay with 32bits binaries, at the moment :D .
So, even if "int"s are by default of 32bit on my windows or linux OS; i can just use "long long int" to use 64bit integers on both: windows and fedora :)
What would be the difference between "long int" and "long long int", if their size were the same? :D
Take this as law: int are not under 16 bit. long int are not under 32 bit. long long int are not under 64bit. The CPU architecture doesn't matter. :cool:
(Of course if you are using some specific microcontroller or some exotic microprocessor the situation may be different...)
So, i don't need 64bit ints.... i already have 64 bit numbers!! they are called "long long" ;) ....i just need to understand why this function to open large files doesn't work in Windows, even if the code is the same that works in my "32bit Fedora GNU/Linux"... :)

majikthise
11th May 2008, 03:38 PM
Well you seem to have all the answers. It's strange that you understand the problem so well, that you don't know how to fix it. :D

fizy
11th May 2008, 04:11 PM
Well you seem to have all the answers. It's strange that you understand the problem so well, that you don't know how to fix it. :D
Yes, that's crazy.

I've run this code:
printf("sizeof(int) = %d.\n", sizeof(int));
printf("sizeof(long) = %d.\n", sizeof(long));
printf("sizeof(long long) = %d.\n", sizeof(long long));
printf("sizeof(off_t) = %d.\n", sizeof(off_t));
printf("sizeof(off64_t) = %d.\n", sizeof(off64_t));
On windows the output is
sizeof(int) = 4.
sizeof(long) = 4.
sizeof(long long) = 8.
sizeof(off_t) = 4. <---the only difference
sizeof(off64_t) = 8.
On linux the output is
sizeof(int) = 4.
sizeof(long) = 4.
sizeof(long long) = 8.
sizeof(off_t) = 8. <---the only difference
sizeof(off64_t) = 8.

The source goes compile well on windows and linux... btw just i don't understand why it doesn't work in windows...
I hope somebody can help me, i'm stuck on this problem...lol

BTW, i've think about a workaround (not a solution, but a trick :D )

---header:

#ifdef WIN32
#include <windows.h>
#else
#define _LARGEFILE_SOURCE 1
#define _LARGEFILE64_SOURCE 1
#define _FILE_OFFSET_BITS 64
#define __USE_LARGEFILE64 1
#include <sys/types.h>
#include <sys/stat.h>
#endif

#include <iostream>
#include <stdio.h>
#include <unistd.h>

using namespace std;

class CLargeFile
{
private:
#ifdef WIN32
HANDLE f_File;//if windows
#else
FILE* f_File;//if not
#endif
public:
virtual ~CLargeFile()
{
Close();
};
bool Open(const char* filename, const char* mode);
off64_t Size();
int Close();
};

---sorce file:
#include ~header~

off64_t CLargeFile::Size()
{
#ifdef WIN32
/*
backup position
seek on the end
get the offset (=size)
restore back old position
*/
#else
struct stat xstat;
if(fstat(fileno(f_File), &xstat)!=0) return -1;
return xstat.st_size;
#endif
}

int CLargeFile::Close()
{
int result = 0;
#ifdef WIN32
if(f_File!=INVALID_HANDLE_VALUE)
{
result = CloseHandle(f_File)==TRUE ? 0 : 1;
f_File = INVALID_HANDLE_VALUE;
}
#else
if(f_File!=NULL)
{
result = fclose(f_File);
f_File = NULL;
}
#endif
return result;
}

bool CLargeFile::Open(const char* filename, const char* mode)
{
#ifdef WIN32
DWORD openflags = 0;
//I need to parse the string "mode" to extract the flags.
// openflags |= READ_MODE;
// openflags |= WRITE_MODE;
// etc....
f_File=CreateFile(filename,......);
return f_File!=INVALID_HANDLE_VALUE;
#else
FILE *f_File = fopen64(filename, mode);
return f_File!=0;
#endif
}

........that's so boring to write it all only to let the code working on Windows !! :mad:
This is why i prefer understand why the "standard C" code doesn't work in windows, but it does in linux... it may be interesting as well. :)
Mirroring all functions in a class for each OS, is a pretty long work... it will work for sure, but i prefer understand why the other "normal" method doesn't work :(

EDIT: majikthise, note that the problem is not about the size of variables, as i'm using always off64_t (defined as you can see as 64bit in *all* systems, windows included). The problem is somewhere else...not about the variables' size. :cool:

Firewing1
11th May 2008, 04:55 PM
iostream/fstream has no limit in file size - They use an internal buffer. If you'd like to send files over a server with a buffer, just remove the calls that deal with "count" (this will make the program not able to print statistics but it will still work) and replace the outfile.write with a write to the server socket. You should be able to implement the same code but inversed - meaning receiving from the socket an writing to a file instead of reading a file and writing to the socket.
Firewing1

fizy
11th May 2008, 06:58 PM
iostream/fstream has no limit in file size - They use an internal buffer. If you'd like to send files over a server with a buffer, just remove the calls that deal with "count" (this will make the program not able to print statistics but it will still work) and replace the outfile.write with a write to the server socket. You should be able to implement the same code but inversed - meaning receiving from the socket an writing to a file instead of reading a file and writing to the socket.
Firewing1
So, are you telling me that, i can use fstat to get the file size; and then use ifstream to read from it? (and ofstream for writing) :rolleyes: i wonder if it really works... and, i'll be not able to seek into the file anyway, right? :rolleyes: (well, at the moment, i don't need that...)

thanks :)

Firewing1
11th May 2008, 07:49 PM
Yup, you can get the file size, read, write and seek. See here for more information:
http://www.cplusplus.com/reference/iostream/ofstream/
Firewing1

brunson
11th May 2008, 08:03 PM
So, you've pretty much covered linear access, appending to the end and stat'ing the file. If you want random access to the file, you could consider mmap()ing it.

fizy
11th May 2008, 08:14 PM
Yup, you can get the file size, read, write and seek. See here for more information:
http://www.cplusplus.com/reference/iostream/ofstream/
Firewing1
Yes, i knew that website and i knew i can get the file size and seek within these streams...but do they works with int64!??? what switch (#define) do i need to declare to allow fstream work with int64? by default fstreams are working with 32bit integers, not int64!
I can't get the file size while they are working with so small as little 32bit integers......

Firewing1
11th May 2008, 09:46 PM
I'm not sure what you mean. If you need to seek with large files, simply use int64 or long as your counter variable... iostreams use buffers, so 32bit integers won't matter since it only reads smaller bits at a time (even if you used the entire int32 - Reading 2GB into memory is definitely not a good idea).
Edit: Either way, this was fixed a long time ago - see this (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=8610) bug
Firewing1

fizy
11th May 2008, 10:58 PM
okay!!!!!!!!!!!!!!! :D :D :D :D

Now, i've another question: space pre-allocation for large files. Is it good or bad?
I'm not using it at the moment. Will space pre-allocation reduce disk's fragmentation? or will it causes more problems?
Maybe the space pre-allocation is good only for clients (download) and a problem (it will keeps the disk busy) for the server whenever a client is uploading a big file? :rolleyes:
Anyway, how can i create a big file (filled with '\0' or with random data...that's not important!). There are an automatic code to pre-alloc space for large files?

*Ah, this was a suspended question :p -- but an important one!
--> What other interesting functions there are in sys/stat.h?
I've never learn includes under sys/*, i just know standard C++ includes (lists, vectors, maps,....), but nothing about sys includes, where can i find a good free updated manual about them? some suggestion?(or google? :D )

cable_txg
12th May 2008, 02:52 AM
(even if you used the entire int32 - Reading 2GB into memory is definitely not a good idea)

:) Some "Professional" programmers do not seem to understand this concept and think because you have 16GB of memory on a server, why not load 3GB of data immediately into a buffer. :( I'll stay as a "Programmer" than a "Senior Developer" .....

:) Just me ranting.... It's good to read topics that validate my core programming knowledge.... Keep up the good works guys.... :D

majikthise
12th May 2008, 03:57 AM
Hey, bugmenot_4_life, :)
I only just noticed that your code doesn't #include <sys/statfs.h>
which #includes <bits/statfs.h>

jbannon
17th May 2008, 02:40 AM
hint: use boost!

fizy
17th May 2008, 10:13 AM
hint: use boost!eheh :D , it's a nice libraries collection, but in this context use boost was a "cheat" as I'd to write the software up my own (by limiting libraries usage as much as possible). :cool:

But the suggestion itself is excellent :D