Fedora Linux Support Community & Resources Center
  #1  
Old 26th July 2007, 02:16 AM
powah Offline
Registered User
 
Join Date: Mar 2005
Posts: 83
join all the lines between any two "$" in a text file

I want to join all the lines between any two "$" in a text file. The number of lines between two "$" varies.
What script will do that?
Thanks.
e.g.
input:
Apr. 07, 2007
STATIONNEMENT DE MONTR
$8.50
Apr. 27, 2007 PRE-AUTHORIZED PAYMENT

$794.86
Jun. 25, 2007 THE HOME DEPOT 7026 $37.56

output:
Apr. 07, 2007 STATIONNEMENT DE MONTR $8.50
Apr. 27, 2007 PRE-AUTHORIZED PAYMENT $794.86
Jun. 25, 2007 THE HOME DEPOT 7026 $37.56
Reply With Quote
  #2  
Old 26th July 2007, 06:21 AM
UnFleshed One's Avatar
UnFleshed One Offline
Registered User
 
Join Date: Jul 2007
Location: Calgary AB
Posts: 34
Might be overkill, and I pity you if you need to parse a big file, but works with your example (with an additional space at the beginning of every line, but that's trivial to remove in postprocessing)

Code:
#!/bin/bash
IFS=$'\n'
file=( $(cat test_file) )

joined_line=""
for line in ${file[*]}
do
	joined_line="$joined_line $line"
	if (( $(expr index '$' "$line" ) != 0 ))
	then
		echo "$joined_line"
		joined_line=""
	fi
done
(Who can make oneliner? )
Reply With Quote
  #3  
Old 26th July 2007, 09:59 AM
RupertPupkin's Avatar
RupertPupkin Offline
Registered User
 
Join Date: Nov 2006
Location: Detroit
Posts: 4,621
Not a one-liner, but it is a few lines shorter, avoids the external call to expr, and removes the extra space at the beginning of each output line:
Code:
#!/bin/bash
joined_line=""
while read line
do
   joined_line="${joined_line% } ${line}"
   if [[ ${line} =~ .*[\$]+.* ]]; then
      echo "${joined_line# }"
      joined_line=""
   fi
done < input
This assumes the input file is called 'input'.
Reply With Quote
  #4  
Old 2nd August 2007, 10:00 AM
ghostdog74 Offline
Registered User
 
Join Date: Sep 2006
Posts: 52
Code:
awk '
        $0 !~ /^\$/ { printf $0" " }
        $0 ~ /^\$/ { print;next}
' file
Reply With Quote
  #5  
Old 2nd August 2007, 10:22 PM
RupertPupkin's Avatar
RupertPupkin Offline
Registered User
 
Join Date: Nov 2006
Location: Detroit
Posts: 4,621
Quote:
Originally Posted by ghostdog74
Code:
awk '
        $0 !~ /^\$/ { printf $0" " }
        $0 ~ /^\$/ { print;next}
' file
That prints a double space when it encounters a blank line.
Reply With Quote
  #6  
Old 3rd August 2007, 02:46 AM
ghostdog74 Offline
Registered User
 
Join Date: Sep 2006
Posts: 52
Quote:
Originally Posted by RupertPupkin
That prints a double space when it encounters a blank line.
Code:
awk '
        $0 !~ /^\$/ && $0 !~ /^$/ { printf $0" " }
        $0 ~ /^\$/ { print;next}

' file
Reply With Quote
  #7  
Old 4th August 2007, 03:59 AM
RupertPupkin's Avatar
RupertPupkin Offline
Registered User
 
Join Date: Nov 2006
Location: Detroit
Posts: 4,621
OK, I am going to have to take a new look at awk, or at least the GNU version (gawk).

I just ran the scripts by UnFleshedOne, ghostdog74, and me, on a "large" file. What I did was take powah's orginal data and concatenate it 10,000 times onto itself, producing a 1.3MB input file. I was curious to see how long each script would take. Here are the results:

UnFleshedOne: 5 minutes 33 seconds
me: 11.1 seconds
ghostdog74: 0.36 seconds

I'm not surprised by the results for UnFleshedOne's script, since he calls an external program (cat) to read the entire input file's contents into memory (ouch!) and then re-reads each line of that file's contents into a giant list passed to the for loop, i.e. into memory again (double ouch!). I suspected that my script would be decent but not great, which turned out about right. The real shocker to me was ghostdog74's awk script, doing the whole thing in less than half a second!

I am very surprised by that, because I stopped using awk about 10 years ago precisely because its performance with large files was terrible. I guess things have really changed, at least for gawk. It's super fast now. Out of curioisity, I increased the size of the input file to 130MB. Sure enough, the awk script took only 33.7 seconds to complete! In comparison, UnFleshedOne's script bombed after 3 minutes 10 seconds with a "can not allocate memory" error before it could even write the output, and my script took 18 minutes 21 seconds.

So I have ghostdog74 to thank for making me see the awk light. Luckily I still have my old copy of The Awk Programming Language laying around.
Reply With Quote
  #8  
Old 4th August 2007, 05:58 AM
UnFleshed One's Avatar
UnFleshed One Offline
Registered User
 
Join Date: Jul 2007
Location: Calgary AB
Posts: 34
Haha, well to create a script that can be killed by mere 100 megs of data is an achievement in itself . And to think about it, I use this kinds of loops over arrays all over the place (the files are really small though).

Awk is a true killer indeed.
__________________
Because it is lonely on the top of the food chain.
Reply With Quote
  #9  
Old 10th August 2007, 06:03 AM
pigpen's Avatar
pigpen Offline
Registered User
 
Join Date: Nov 2003
Location: Regensburg, Germany
Age: 42
Posts: 447
Anyone mentioning IFS=... is my hero! Cost me a trillion hours to find this one out!
__________________
/(bb|[^b]{2})/ -- that is the question!

Last edited by pigpen; 10th August 2007 at 06:05 AM. Reason: typo
Reply With Quote
  #10  
Old 10th August 2007, 07:53 AM
mndar's Avatar
mndar Online
Registered User
 
Join Date: Feb 2005
Posts: 1,101
Here's a oneliner
Code:
cat input|tr "\n" " "| sed '/Jan\|Feb\|Mar\|Apr\|May\|Jun\|Jul\|Aug\|Sept\|Oct\|Nov\|Dec/ s//\
&/g'|sed 's/  */ /g'
Not sure if it will work with all your data but it certainly does for sample you posted.
I've assumed that all lines will begin with some month. You might have to adjust the month abbreviations and hope that they don't match some other text in your input data.
PS: Remove the spaces between Oct\ and |Nov\. The [CODE] tag seems to be adding them. So, I've attached the script.
Attached Files
File Type: txt test.sh.txt (127 Bytes, 98 views)

Last edited by mndar; 10th August 2007 at 10:19 AM.
Reply With Quote
  #11  
Old 13th August 2007, 09:46 AM
ghostdog74 Offline
Registered User
 
Join Date: Sep 2006
Posts: 52
Quote:
Originally Posted by mndar
Here's a oneliner
Code:
cat input|tr "\n" " "| sed '/Jan\|Feb\|Mar\|Apr\|May\|Jun\|Jul\|Aug\|Sept\|Oct\|Nov\|Dec/ s//\
&/g'|sed 's/  */ /g'
you don't have to use cat for that..
Code:
tr "\n" " " < input  |  sed .........
Reply With Quote
  #12  
Old 13th August 2007, 01:33 PM
sideways's Avatar
sideways Offline
Retired User
 
Join Date: Oct 2004
Location: London, UK
Posts: 4,999
a little c code compiled with gcc on F7 is 10 times faster than that awk script

Code:
#include <stdio.h>
#include <string.h>

int main()
{
  char buf[256], joined_str[1024];

  *joined_str = 0;

  while (fgets(buf, 256, stdin) != NULL)
  {
    *strchr(buf, '\n') = ' ';
    strcat(joined_str, buf);
    if (strchr(joined_str, '$') != NULL)
    {
      puts(joined_str);
      *joined_str = 0;
    }
  }

 return 0;
}
Code:
gcc -o join_lines join_lines.c
./join_lines < input_file > output_file
(The code is a shoddy snippet and for test purposes only)
Reply With Quote
  #13  
Old 13th August 2007, 05:30 PM
sideways's Avatar
sideways Offline
Retired User
 
Join Date: Oct 2004
Location: London, UK
Posts: 4,999
This is probably uber-nerdy, but I was sufficently interested by RupertPupkin's results to investigate myself. I also shyed away from awk for large processing tasks, thinking it was not up to it. Now, while these results show that a minimal C program can parse the 125mb file in ~3 seconds (!), awk's sub-minute showing is very reasonable, and considering that you would process files of this size quite infrequently it's definitely worth considering before creating C code. I ran the tests on an AMD64 3500+ with 1 GB of memory. I amended the C code to include basic buffer overflow checks, but (unlike the other three candidates) it will produce incorrect results for unusually long lines in the input file.

This was the result for a 125MB file created by concatenating the original repeatedly (which is not an ideal test file, and may cause abnormal behaviour for the benchmarks)

C code = ~ 3 secs
CPP code = ~ 29 secs
awk script (ghostdog74) = ~53 secs
shell script (RupertPupkin) = ~15 minutes

So, basic C, with minimal error checking clearly rules, but awk's performance is very impressive considering it is a script.

[jb@linux7 test]$ time ./myparse_c < input > outfile

real 0m3.178s
user 0m2.544s
sys 0m0.533s


[jb@linux7 test]$ time ./myparse_cpp < input > outfile

real 0m29.708s
user 0m19.296s
sys 0m9.989s


[jb@linux7 test]$ time awk '$0 !~ /^\$/ && $0 !~ /^$/ { printf $0" " } $0 ~ /^\$/ { print; printf "\n"; next} ' input > outfile

real 0m53.793s
user 0m52.839s
sys 0m0.651s


[jb@linux7 test]$ time while read line; do joined_line="${joined_line% } ${line}"; if [[ ${line} =~ .*[\$]+.* ]]; then echo "${joined_line# }"; joined_line=""; fi; done < input > outfile

real 14m55.591s
user 14m12.884s
sys 0m36.456s


Code:
/****   myparse.c    ****/

#include <stdio.h>
#include <string.h>

#define MAX_LENGTH 1024

int main()
{
  char buf[256], joined_str[MAX_LENGTH];

  *joined_str = 0;

  while (fgets(buf, 256, stdin) != NULL)
  {
    char* p = strchr(buf, '\n');
    if ( p != NULL ) *p = ' ';
    
    int i = strlen(joined_str) + strlen(buf); 
    if (i < MAX_LENGTH) strcat(joined_str, buf);

    if (strchr(buf, '$') != NULL)
    { 
      puts(joined_str);
      *joined_str = 0;
    }
  }

 return 0;
}

Code:
/****   myparse.cpp   ****/

#include <iostream>
#include <string>

using namespace std;

int main()
{
  string buf, joined_str("");

  while (getline(cin, buf, '\n'))
  {
    joined_str += buf + " "; 
    if (buf.find('$') != string::npos)
    { 
      cout << joined_str << "\n";
      joined_str = "";
    }
  }

 return 0;
}
Reply With Quote
  #14  
Old 14th August 2007, 02:41 AM
RupertPupkin's Avatar
RupertPupkin Offline
Registered User
 
Join Date: Nov 2006
Location: Detroit
Posts: 4,621
Quote:
Originally Posted by sideways
I amended the C code to include basic buffer overflow checks, but (unlike the other three candidates) it will produce incorrect results for unusually long lines in the input file.
Well, that is kind of important, don't you think? You're not comparing apples to apples. You need to modify your C code to be able to handle not just the case where there can be more than 256 characters in a line, but also the case where there are more than 1024 characters between dollar sign symbols. Neither the awk nor bash scripts have those limitations, as you noted. In fact, you also need to modify your code to not replace a blank line by a space, which it's doing right now. Those 3 changes probably won't slow down your C program by very much, but it would at least make for a fairer comparison.

By the way, for some reason you added an extra printf statement to the awk script, which may explain why on your faster machine that script ran about 20 seconds slower than on my much slower machine.

I concatenated the original data onto itself 1 million times to create a 125MB (1,300,000 bytes) file, here's what I got:

Code:
C code:
time myparse < input > output

real    0m5.981s
user    0m4.422s
sys     0m1.290s

awk script:
time ./joinlines.awk input > output

real    0m32.960s
user    0m31.225s
sys     0m1.322s
So the C program is about 5-6 times faster than the awk script without the necessary changes, I'll be curious to see how much that changes after the C code is modified.
Reply With Quote
  #15  
Old 14th August 2007, 07:25 AM
ghostdog74 Offline
Registered User
 
Join Date: Sep 2006
Posts: 52
how about this:
get the awk source code, extract the part where it does what the C code does, compile the code to become a "mini" awk which only peforms the same function as the C code. Then we can compare apples to apples.
how does that sound?
Reply With Quote
Reply

Tags
file, join, lines, text

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
terminal problem after run "more" a text file ask8y@yahoo.com Using Fedora 1 19th September 2009 01:30 AM
"ls" and "cd" treat ".." differently inside symlinked directories bnorman Using Fedora 0 19th June 2008 04:49 PM
Trying to "net ads join -U Administrator@TEST.LOCAL" mazalona Servers & Networking 1 5th November 2006 06:21 PM
"enscript --word-wrap" does not wrap line of text file powah Using Fedora 0 10th May 2006 06:44 PM
Error:visibility arg must be one of "default", "hidden", "protected" or "internal" wangfeng Using Fedora 0 23rd May 2005 04:59 AM


Current GMT-time: 15:06 (Saturday, 25-05-2013)

TopSubscribe to XML RSS for all Threads in all ForumsFedoraForumDotOrg Archive
logo

All trademarks, and forum posts in this site are property of their respective owner(s).
FedoraForum.org is privately owned and is not directly sponsored by the Fedora Project or Red Hat, Inc.

Privacy Policy | Term of Use | Posting Guidelines | Archive | Contact Us | Founding Members

Powered by vBulletin® Copyright ©2000 - 2012, vBulletin Solutions, Inc.

FedoraForum is Powered by RedHat