Fedora Linux Support Community & Resources Center

Go Back   FedoraForum.org > Fedora 17/18 > Using Fedora
FedoraForum Search

Forgot Password? Join Us!

Using Fedora General support for current versions. Ask questions about Fedora and it's software that do not belong in any other forum.

Reply
 
Thread Tools Search this Thread Display Modes
  #1  
Old 28th May 2012, 02:17 PM
papori Offline
Registered User
 
Join Date: Nov 2010
Posts: 58
linuxfedorafirefox
How can i make file with unique lines?

Hi all,
I know i can use uniq to removes duplicate lines from a uniqed file.

But i have 2 problems with this order:
First, i must have sorted file(and it takes much time.. i have 150G text file)
Second, i want to removes duplicate lines from a uniqed file + report how many times each duplicate appeared..

is there any smart way for doing it without using:
1) sort myfile.txt
2)uniq -c mysortedfile.txt
3)uniq mysortedfile.txt

?

Thanks..

Pap
Reply With Quote
  #2  
Old 28th May 2012, 02:40 PM
DBelton's Avatar
DBelton Offline
Administrator
 
Join Date: Aug 2009
Posts: 6,620
linuxfirefox
Re: How can i make file with unique lines?

You could try using awk instead. I haven't tried it on a file that large, but it works great on files I have tried it with, and no need to sort the file first.

Code:
awk ' !x[$0]++' original.file > new.file
Reply With Quote
  #3  
Old 28th May 2012, 03:02 PM
papori Offline
Registered User
 
Join Date: Nov 2010
Posts: 58
linuxfedorafirefox
Re: How can i make file with unique lines?

Hi DBelton,
Thanks for the fast response..

can you explain the order?

my input was:
@HISEQ1:1360NU8ACXX:5:1101:4153:1981 1:N:0:ACTTGA
ATTGCAATACTAACTGTGACTCAACTGAGGAGTTAAGATGTANNNNNNNN
@HISEQ1:1360NU8ACXX:5:1101:4153:1981 1:N:0:ACTTGA
ATTGCAATACTAACTGTGACTCAACTGAGGAGTTAAGATGTANNNNNNNN
@HISEQ1:1360NU8ACXX:5:1101:4153:1981 1:N:0:ACTTGA
ATTGCAATACTAACTGTGACTCAACTGAGGAGTTAAGATGTANNNNNNNN
@HISEQ1:1360NU8ACXX:5:1101:4153:1981 1:N:0:ACTTGA
ATTGCAATACTAACTGTGACTCAACTGAGGAGTTAAGATGTANNNNNNNN


the output:
@HISEQ1:1360NU8ACXX:5:1101:4153:1981 1:N:0:ACTTGA
ATTGCAATACTAACTGTGACTCAACTGAGGAGTTAAGATGTANNNNNNNN


obviously the output is unique.
But still i want to know how many times each duplicate appeared.??

Thanks
Pap
Reply With Quote
  #4  
Old 28th May 2012, 03:21 PM
DBelton's Avatar
DBelton Offline
Administrator
 
Join Date: Aug 2009
Posts: 6,620
linuxfirefox
Re: How can i make file with unique lines?

well, you could run something like this before the above command to delete the duplicate lines:

Code:
awk '{a[$0]++}END{for(i in a){if(a[i]-1)print i,a[i]}}' original.file
awk ' !x[$0]++' original.file > new.file
This will count each duplicate line and give you the number of times each occurred.
Reply With Quote
  #5  
Old 28th May 2012, 03:43 PM
papori Offline
Registered User
 
Join Date: Nov 2010
Posts: 58
linuxfedorafirefox
Re: How can i make file with unique lines?

instead of using 2 commands, can i merged them?
Reply With Quote
  #6  
Old 28th May 2012, 03:54 PM
DBelton's Avatar
DBelton Offline
Administrator
 
Join Date: Aug 2009
Posts: 6,620
linuxfirefox
Re: How can i make file with unique lines?

easy way?

just create you a script file with the 2 commands in it and run the script when you need it.

simplest form:
Code:
#! /bin/bash
infile=$1
outfile=$2
awk '{a[$0]++}END{for(i in a){if(a[i]-1)print i,a[i]}}' $infile
awk ' !x[$0]++' $infile > $outfile
save as a script, set to executable and run it with the 2 filenames as parameters.
Reply With Quote
  #7  
Old 28th May 2012, 04:19 PM
papori Offline
Registered User
 
Join Date: Nov 2010
Posts: 58
linuxfedorafirefox
Re: How can i make file with unique lines?

Thanks, but iwhat i meant is that i want to do those commands parallel..(simultaneous for saving time)
i dont want to go over the file twice..(its a huge file, and it takes long)
Reply With Quote
  #8  
Old 28th May 2012, 07:46 PM
weitjong's Avatar
weitjong Online
Registered User
 
Join Date: Oct 2006
Location: Singapore, 新加坡
Posts: 789
linuxfirefox
Re: How can i make file with unique lines?

I have tried this on a small set of input and it seems to work. Not sure it will scale to 150GB data file or not.

Code:
perl -ne '$seq{$_}++; END { open COUNT, ">count.dat"; open UNIQUE, ">unique.dat"; for (sort keys %seq) { printf COUNT "%7u $_", "$seq{$_}"; print UNIQUE $_ } }' input.dat
This one liner will produce two output files. If you don't need the output to be in sorted order, you may remove the "sort" function in the above code to speed up the performance a little bit.
__________________
YaoWT - Leave no window unbroken ^_^
Reply With Quote
  #9  
Old 29th May 2012, 06:33 AM
papori Offline
Registered User
 
Join Date: Nov 2010
Posts: 58
linuxfedorafirefox
Re: How can i make file with unique lines?

Thanks weitjong!
i dont need the output to be sort,just to take out the sort as this:
perl -ne '$seq{$_}++; END { open COUNT, ">count.dat"; open UNIQUE, ">unique.dat"; for (keys %seq) { printf COUNT "%7u $_", "$seq{$_}"; print UNIQUE $_ } }' input.dat
?

is it should be faster than "sort" and then uniq?

Thanks
Pap
Reply With Quote
  #10  
Old 29th May 2012, 11:29 AM
weitjong's Avatar
weitjong Online
Registered User
 
Join Date: Oct 2006
Location: Singapore, 新加坡
Posts: 789
Sure, if you don't need the the output to be sorted then removing the sort function will speed up the performance. The improvement from my scrip is two prongs:
1. No need to sort if not required. Perl hash does not need the input lines to be sorted like uniq command.
2. It parses the input file once and generate two output files in one go.

It should be at least half the processing time against your original proposal of using one sort and two uniq commands. Having said that, if the performance is truly your concern and you need to perform this operation frequently then you may as well invest your time to write a C program to get the raw speed.


Sent from my iPhone using Tapatalk
__________________
YaoWT - Leave no window unbroken ^_^
Reply With Quote
  #11  
Old 29th May 2012, 01:27 PM
stevea's Avatar
stevea Offline
Registered User
 
Join Date: Apr 2006
Location: Ohio, USA
Posts: 8,346
linuxfirefox
Re: How can i make file with unique lines?

Quote:
Originally Posted by papori View Post
my input was:
@HISEQ1:1360NU8ACXX:5:1101:4153:1981 1:N:0:ACTTGA
ATTGCAATACTAACTGTGACTCAACTGAGGAGTTAAGATGTANNNNNNNN
@HISEQ1:1360NU8ACXX:5:1101:4153:1981 1:N:0:ACTTGA
ATTGCAATACTAACTGTGACTCAACTGAGGAGTTAAGATGTANNNNNNNN
@HISEQ1:1360NU8ACXX:5:1101:4153:1981 1:N:0:ACTTGA
ATTGCAATACTAACTGTGACTCAACTGAGGAGTTAAGATGTANNNNNNNN
@HISEQ1:1360NU8ACXX:5:1101:4153:1981 1:N:0:ACTTGA
ATTGCAATACTAACTGTGACTCAACTGAGGAGTTAAGATGTANNNNNNNN

Don't you need to keep the "@HISEQ1:... "lines associated with the immediately following RNA sequence ? None of these algorithms will do that.

When you say 'unique' do you mean the HISEQ1: line AND the RNA sequence must match - or just the RNA ?

---------- Post added at 08:26 AM ---------- Previous post was at 08:24 AM ----------

Doesn't AbySS do his and more ?

---------- Post added at 08:27 AM ---------- Previous post was at 08:26 AM ----------

http://www.bcgsc.ca/downloads/abyss/abyss-1.3.3.tar.gz
__________________
None are more hopelessly enslaved than those who falsely believe they are free.
Johann Wolfgang von Goethe
Reply With Quote
Reply

Tags
file, lines, make, unique

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Script to Remove few lines from a file fed.linuxgossip Using Fedora 3 5th March 2007 09:46 PM
Command to Remove last 5 lines in a file fed.linuxgossip Using Fedora 3 4th March 2007 09:06 PM
Delete lines from a file panpol Using Fedora 3 27th October 2005 04:46 PM
Combining 2 net lines to make 1 outgoing Sushubh Servers & Networking 2 19th July 2005 05:00 PM


Current GMT-time: 05:39 (Wednesday, 19-06-2013)

TopSubscribe to XML RSS for all Threads in all ForumsFedoraForumDotOrg Archive
logo

All trademarks, and forum posts in this site are property of their respective owner(s).
FedoraForum.org is privately owned and is not directly sponsored by the Fedora Project or Red Hat, Inc.

Privacy Policy | Term of Use | Posting Guidelines | Archive | Contact Us | Founding Members

Powered by vBulletin® Copyright ©2000 - 2012, vBulletin Solutions, Inc.

FedoraForum is Powered by RedHat