 |
 |
 |
 |
| Programming & Packaging A place to discuss programming and packaging. |

26th July 2007, 02:16 AM
|
|
Registered User
|
|
Join Date: Mar 2005
Posts: 83

|
|
|
join all the lines between any two "$" in a text file
I want to join all the lines between any two "$" in a text file. The number of lines between two "$" varies.
What script will do that?
Thanks.
e.g.
input:
Apr. 07, 2007
STATIONNEMENT DE MONTR
$8.50
Apr. 27, 2007 PRE-AUTHORIZED PAYMENT
$794.86
Jun. 25, 2007 THE HOME DEPOT 7026 $37.56
output:
Apr. 07, 2007 STATIONNEMENT DE MONTR $8.50
Apr. 27, 2007 PRE-AUTHORIZED PAYMENT $794.86
Jun. 25, 2007 THE HOME DEPOT 7026 $37.56
|

26th July 2007, 06:21 AM
|
 |
Registered User
|
|
Join Date: Jul 2007
Location: Calgary AB
Posts: 34

|
|
Might be overkill, and I pity you if you need to parse a big file, but works with your example (with an additional space at the beginning of every line, but that's trivial to remove in postprocessing)
Code:
#!/bin/bash
IFS=$'\n'
file=( $(cat test_file) )
joined_line=""
for line in ${file[*]}
do
joined_line="$joined_line $line"
if (( $(expr index '$' "$line" ) != 0 ))
then
echo "$joined_line"
joined_line=""
fi
done
(Who can make oneliner?  )
|

26th July 2007, 09:59 AM
|
 |
Registered User
|
|
Join Date: Nov 2006
Location: Detroit
Posts: 4,621

|
|
Not a one-liner, but it is a few lines shorter, avoids the external call to expr, and removes the extra space at the beginning of each output line:
Code:
#!/bin/bash
joined_line=""
while read line
do
joined_line="${joined_line% } ${line}"
if [[ ${line} =~ .*[\$]+.* ]]; then
echo "${joined_line# }"
joined_line=""
fi
done < input
This assumes the input file is called 'input'.
|

2nd August 2007, 10:00 AM
|
|
Registered User
|
|
Join Date: Sep 2006
Posts: 52

|
|
Code:
awk '
$0 !~ /^\$/ { printf $0" " }
$0 ~ /^\$/ { print;next}
' file
|

2nd August 2007, 10:22 PM
|
 |
Registered User
|
|
Join Date: Nov 2006
Location: Detroit
Posts: 4,621

|
|
Quote:
|
Originally Posted by ghostdog74
Code:
awk '
$0 !~ /^\$/ { printf $0" " }
$0 ~ /^\$/ { print;next}
' file
|
That prints a double space when it encounters a blank line.
|

3rd August 2007, 02:46 AM
|
|
Registered User
|
|
Join Date: Sep 2006
Posts: 52

|
|
Quote:
|
Originally Posted by RupertPupkin
That prints a double space when it encounters a blank line.
|
Code:
awk '
$0 !~ /^\$/ && $0 !~ /^$/ { printf $0" " }
$0 ~ /^\$/ { print;next}
' file
|

4th August 2007, 03:59 AM
|
 |
Registered User
|
|
Join Date: Nov 2006
Location: Detroit
Posts: 4,621

|
|
OK, I am going to have to take a new look at awk, or at least the GNU version (gawk).
I just ran the scripts by UnFleshedOne, ghostdog74, and me, on a "large" file. What I did was take powah's orginal data and concatenate it 10,000 times onto itself, producing a 1.3MB input file. I was curious to see how long each script would take. Here are the results:
UnFleshedOne: 5 minutes 33 seconds
me: 11.1 seconds
ghostdog74: 0.36 seconds
I'm not surprised by the results for UnFleshedOne's script, since he calls an external program (cat) to read the entire input file's contents into memory (ouch!) and then re-reads each line of that file's contents into a giant list passed to the for loop, i.e. into memory again (double ouch!). I suspected that my script would be decent but not great, which turned out about right. The real shocker to me was ghostdog74's awk script, doing the whole thing in less than half a second!
I am very surprised by that, because I stopped using awk about 10 years ago precisely because its performance with large files was terrible. I guess things have really changed, at least for gawk. It's super fast now. Out of curioisity, I increased the size of the input file to 130MB. Sure enough, the awk script took only 33.7 seconds to complete! In comparison, UnFleshedOne's script bombed after 3 minutes 10 seconds with a "can not allocate memory" error before it could even write the output, and my script took 18 minutes 21 seconds.
So I have ghostdog74 to thank for making me see the awk light. Luckily I still have my old copy of The Awk Programming Language laying around.
|

4th August 2007, 05:58 AM
|
 |
Registered User
|
|
Join Date: Jul 2007
Location: Calgary AB
Posts: 34

|
|
Haha, well to create a script that can be killed by mere 100 megs of data is an achievement in itself  . And to think about it, I use this kinds of loops over arrays all over the place (the files are really small though).
Awk is a true killer indeed.
__________________
Because it is lonely on the top of the food chain.
|

10th August 2007, 06:03 AM
|
 |
Registered User
|
|
Join Date: Nov 2003
Location: Regensburg, Germany
Age: 42
Posts: 447

|
|
|
Anyone mentioning IFS=... is my hero! Cost me a trillion hours to find this one out!
__________________
/(bb|[^b]{2})/ -- that is the question!
Last edited by pigpen; 10th August 2007 at 06:05 AM.
Reason: typo
|

10th August 2007, 07:53 AM
|
 |
Registered User
|
|
Join Date: Feb 2005
Posts: 1,101

|
|
Here's a oneliner
Code:
cat input|tr "\n" " "| sed '/Jan\|Feb\|Mar\|Apr\|May\|Jun\|Jul\|Aug\|Sept\|Oct\|Nov\|Dec/ s//\
&/g'|sed 's/ */ /g'
Not sure if it will work with all your data but it certainly does for sample you posted.
I've assumed that all lines will begin with some month. You might have to adjust the month abbreviations and hope that they don't match some other text in your input data.
PS: Remove the spaces between Oct\ and |Nov\. The [CODE] tag seems to be adding them. So, I've attached the script.
Last edited by mndar; 10th August 2007 at 10:19 AM.
|

13th August 2007, 09:46 AM
|
|
Registered User
|
|
Join Date: Sep 2006
Posts: 52

|
|
Quote:
|
Originally Posted by mndar
Here's a oneliner
Code:
cat input|tr "\n" " "| sed '/Jan\|Feb\|Mar\|Apr\|May\|Jun\|Jul\|Aug\|Sept\|Oct\|Nov\|Dec/ s//\
&/g'|sed 's/ */ /g'
|
you don't have to use cat for that..
Code:
tr "\n" " " < input | sed .........
|

13th August 2007, 01:33 PM
|
 |
Retired User
|
|
Join Date: Oct 2004
Location: London, UK
Posts: 4,999

|
|
a little c code compiled with gcc on F7 is 10 times faster than that awk script
Code:
#include <stdio.h>
#include <string.h>
int main()
{
char buf[256], joined_str[1024];
*joined_str = 0;
while (fgets(buf, 256, stdin) != NULL)
{
*strchr(buf, '\n') = ' ';
strcat(joined_str, buf);
if (strchr(joined_str, '$') != NULL)
{
puts(joined_str);
*joined_str = 0;
}
}
return 0;
}
Code:
gcc -o join_lines join_lines.c
./join_lines < input_file > output_file
(The code is a shoddy snippet and for test purposes only)
|

13th August 2007, 05:30 PM
|
 |
Retired User
|
|
Join Date: Oct 2004
Location: London, UK
Posts: 4,999

|
|
This is probably uber-nerdy, but I was sufficently interested by RupertPupkin's results to investigate myself. I also shyed away from awk for large processing tasks, thinking it was not up to it. Now, while these results show that a minimal C program can parse the 125mb file in ~3 seconds (!), awk's sub-minute showing is very reasonable, and considering that you would process files of this size quite infrequently it's definitely worth considering before creating C code. I ran the tests on an AMD64 3500+ with 1 GB of memory. I amended the C code to include basic buffer overflow checks, but (unlike the other three candidates) it will produce incorrect results for unusually long lines in the input file.
This was the result for a 125MB file created by concatenating the original repeatedly (which is not an ideal test file, and may cause abnormal behaviour for the benchmarks)
C code = ~ 3 secs
CPP code = ~ 29 secs
awk script (ghostdog74) = ~53 secs
shell script (RupertPupkin) = ~15 minutes
So, basic C, with minimal error checking clearly rules, but awk's performance is very impressive considering it is a script.
[jb@linux7 test]$ time ./myparse_c < input > outfile
real 0m3.178s
user 0m2.544s
sys 0m0.533s
[jb@linux7 test]$ time ./myparse_cpp < input > outfile
real 0m29.708s
user 0m19.296s
sys 0m9.989s
[jb@linux7 test]$ time awk '$0 !~ /^\$/ && $0 !~ /^$/ { printf $0" " } $0 ~ /^\$/ { print; printf "\n"; next} ' input > outfile
real 0m53.793s
user 0m52.839s
sys 0m0.651s
[jb@linux7 test]$ time while read line; do joined_line="${joined_line% } ${line}"; if [[ ${line} =~ .*[\$]+.* ]]; then echo "${joined_line# }"; joined_line=""; fi; done < input > outfile
real 14m55.591s
user 14m12.884s
sys 0m36.456s
Code:
/**** myparse.c ****/
#include <stdio.h>
#include <string.h>
#define MAX_LENGTH 1024
int main()
{
char buf[256], joined_str[MAX_LENGTH];
*joined_str = 0;
while (fgets(buf, 256, stdin) != NULL)
{
char* p = strchr(buf, '\n');
if ( p != NULL ) *p = ' ';
int i = strlen(joined_str) + strlen(buf);
if (i < MAX_LENGTH) strcat(joined_str, buf);
if (strchr(buf, '$') != NULL)
{
puts(joined_str);
*joined_str = 0;
}
}
return 0;
}
Code:
/**** myparse.cpp ****/
#include <iostream>
#include <string>
using namespace std;
int main()
{
string buf, joined_str("");
while (getline(cin, buf, '\n'))
{
joined_str += buf + " ";
if (buf.find('$') != string::npos)
{
cout << joined_str << "\n";
joined_str = "";
}
}
return 0;
}
|

14th August 2007, 02:41 AM
|
 |
Registered User
|
|
Join Date: Nov 2006
Location: Detroit
Posts: 4,621

|
|
Quote:
|
Originally Posted by sideways
I amended the C code to include basic buffer overflow checks, but (unlike the other three candidates) it will produce incorrect results for unusually long lines in the input file.
|
Well, that is kind of important, don't you think?  You're not comparing apples to apples. You need to modify your C code to be able to handle not just the case where there can be more than 256 characters in a line, but also the case where there are more than 1024 characters between dollar sign symbols. Neither the awk nor bash scripts have those limitations, as you noted. In fact, you also need to modify your code to not replace a blank line by a space, which it's doing right now. Those 3 changes probably won't slow down your C program by very much, but it would at least make for a fairer comparison.
By the way, for some reason you added an extra printf statement to the awk script, which may explain why on your faster machine that script ran about 20 seconds slower than on my much slower machine.
I concatenated the original data onto itself 1 million times to create a 125MB (1,300,000 bytes) file, here's what I got:
Code:
C code:
time myparse < input > output
real 0m5.981s
user 0m4.422s
sys 0m1.290s
awk script:
time ./joinlines.awk input > output
real 0m32.960s
user 0m31.225s
sys 0m1.322s
So the C program is about 5-6 times faster than the awk script without the necessary changes, I'll be curious to see how much that changes after the C code is modified.
|

14th August 2007, 07:25 AM
|
|
Registered User
|
|
Join Date: Sep 2006
Posts: 52

|
|
|
how about this:
get the awk source code, extract the part where it does what the C code does, compile the code to become a "mini" awk which only peforms the same function as the C code. Then we can compare apples to apples.
how does that sound?
|
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Hybrid Mode
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
Current GMT-time: 15:06 (Saturday, 25-05-2013)
|
|
 |
 |
 |
 |
|
|