Fedora Linux Support Community & Resources Center
  #1  
Old 12th March 2017, 10:15 PM
lsatenstein Offline
Registered User
 
Join Date: Jun 2005
Location: Montreal, Que, Canada
Posts: 3,807
linuxfedorafirefox
utilities optimisation

As a retirement pasttime, I write code for security. One of my functions uses the sha1 algorithm as a lookup key

This algorithm, and from what I have viewed, (md5, sha2, sha256, sha512) are programmed using 32 bit integers.

Has anyone seen code where the 64 bit integers are used?

One other item I noticed is that the Linux hashes show results in hex.
eg

In lieu of displaying hex, one could use a base 64 output against the following
"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLM NOPQRSTUVWXYZ=$"

I took the print function from the sha1 program and modifed it as shown below.
Code:
#if 0         // Original base hex output
static char *sha1_print(SHA1_CONTEXT *ctx)
{
    static char bfrhash[41];
  
    sprintf (bfrhash,"%8.08x%8.08x%8.08x%8.08x%8.08x",
		    (unsigned)ctx->h0,(unsigned)ctx->h1,
                    (unsigned)ctx->h2,(unsigned)ctx->h3,
                    (unsigned)ctx->h4);
    return (bfrhash);
}

#else     // Revised, base64 output

static char *sha1_print(SHA1_CONTEXT *ctx)
{
    static char bfrhash[41];
    char *cp;
    unsigned  tmp;
    int i,j;
    cp=bfrhash;
    for(j=0;j<4;j++)
    {
      switch( j )
      {
        case 0:
	  tmp=(unsigned)ctx->h0;
	  //tmp=0xffffffff;           //was used to determine max i  (6 loops)         
	  break;
        case 1:
	  tmp=(unsigned)ctx->h1;
	  break;
        case 2:
	  tmp=(unsigned)ctx->h2;
	  break;
        case 3:
	  tmp=(unsigned)ctx->h3;
	  break;
        case 4:
	  tmp=(unsigned)ctx->h4;
	  break;
      }
      for(i=0;i<6;i++)
      {
         *cp++="0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ=$"[tmp%64];
         tmp/=64;
      }
    }
    *cp=0;
    return (bfrhash);
}

#endif

Here is the original and the replacement output
ORIGIONAL

Code:
7d154332c533b593191d77e90a32f47241e1679a  makefile
536a25aa3cea335f4ee0d4716f88b3bac5c27ed5  sha1
c9f135f87fc4d541cc53cdc0262bc84ebc5b2c5d  sha1.c
74f8b7e8110933b1520008c037d9ce71ef168de1  sha.c
81cf224a23b6b2d67068240902d39be506ae0b70  sha.h
31cd553308219d7983655c41a549f27dc5cd7aaf  sha.o
c1c6d30875ae241891227eea05816376c000808c  shatest.c
c544db1088374779499ff0109bff765520a3ec76  zzLastBkup
BASE 64 OUTPUT
Code:
Ock5Z1jmXc53Fvn7p0OhLca0  makefile
Gmyqj1vdzWY0NhdUe1WebyL1  sha1
UnjY931ldN$10TYkc3exYaC0  sha1.c
Evb=Q1Nej2h00z00i1NVsST0  sha.c
a9OP12mbHJz09g2qM1BLVQ20  sha.h
PklPN0VRp8801Nlp32Z9viB2  sha.o
8cJN13ogyHR1GXD8h2Sdmw50  shatest.c
gIdh53VtQd82g0$D91lpT$r2  zzLastBkup
Hash string length representation went from 40 characters to 25, with a mapping of 1 to 1. (no loss of data)
For shaxxxsums the ratio is 60% string length reduction.

The string for sha256sum can be reduced from 64 to 40. chars
The string for sha512syn can be reduced from 128 to 80 chars
__________________
Leslie in Montreal

Interesting web sites list
http://forums.fedoraforum.org/showth...40#post1697840

Last edited by lsatenstein; 15th March 2017 at 12:51 AM.
Reply With Quote
  #2  
Old 13th March 2017, 01:05 AM
ocratato Online
Registered User
 
Join Date: Oct 2010
Location: Canberra
Posts: 2,476
linuxfirefox
Re: utilities optimisation

A few points if I may:

1/ There seems to be a consensus that it is time to move away from sha1.

2/ A 64 bit version of the algorithm may be an issue for some low powered devices - such as "Internet of Things" stuff which still need security. Having two versions to maintain could also be an issue.

3/ Besides being a little shorter, what advantage is there for base64. Having two different display formats could create confusion.
__________________
Has anyone seriously considered that it might be turtles all the way down?
That's very old fashioned thinking.
The current model is that it's holographic nested virtualities of turtles, all the way down.
Reply With Quote
  #3  
Old 15th March 2017, 01:06 AM
lsatenstein Offline
Registered User
 
Join Date: Jun 2005
Location: Montreal, Que, Canada
Posts: 3,807
linuxfedorafirefox
Re: utilities optimisation

Hi Ocratato,

What I did was to look at shrinking information without losing information,

The idea that there is a 5/8 reduction in hash string length is not difficult to accommodate.

One can rename the functions from sha1sum to xxx25sum, sha256sum to xxx40sum sha512sum to xxx80sum.

I run a list of every file on my computer, I then sort by the strings and look for duplicate hashes. I prefer to compare 40 character strings in lieu of 64.

The reason it works, is the shaxxx calculations are based on multiple 32 bit unsigned integers.

What name to suggest for xxx?
__________________
Leslie in Montreal

Interesting web sites list
http://forums.fedoraforum.org/showth...40#post1697840
Reply With Quote
  #4  
Old 15th March 2017, 01:30 AM
bobx001 Offline
Registered User
 
Join Date: Dec 2012
Location: santa barbara, CA
Posts: 338
linuxfedorafirefox
Re: utilities optimisation

lol, your shabby64sum is not bad at all.

a while ago I ended up converting a company from java + oracle cr4p to c + (my shm stuff) + php + postgres , and we ran into google suddenly able to spider all our content at lightning speed, which led to caching, which led to the following problem:

our sheeple in the "product categorization department" would switch products around, one day this headphone was in computers, the next day it was in stereos, etc etc etc..

at the end of my work there we had 800,000 URLs redirected at ultra bobby-beaver speed, by having the URLs converted to crc32 and putting that into shared memory with my indexing tech.

Interestingly, we had a few duplicate crc32 values which gave us a different URL, so I had to switch to non-unique index, and implement a walk subroutine to search for the proper match.

I remember at that time I searched for a way to implement crc64, could not find the solution anyway, so I eventually ended up programming my own weirdness, sort of an md5sum, but mixing base64 chars too.

I will try to find the code to see if we can mix and match. At home I back up all my stuff into large internal drives on docking stations, (on my 15th so far, latest one is 6TB), and I built a postgres database thingy with php, that indexes all files according to their "first 5MB of md5sum" (so, if I have a 4gb video, the system does not spend an hour calculating the md5sum of the whole file). And it creates "delete orders", so if the data is duplicate, and names match, I always delete from the oldest drive first.

my 0.02$
Reply With Quote
  #5  
Old 15th March 2017, 03:09 AM
lsatenstein Offline
Registered User
 
Join Date: Jun 2005
Location: Montreal, Que, Canada
Posts: 3,807
linuxfedorafirefox
Re: utilities optimisation

Quote:
Originally Posted by bobx001 View Post
lol, your shabby64sum is not bad at all.

a while ago I ended up converting a company from java + oracle cr4p to c + (my shm stuff) + php + postgres , and we ran into google suddenly able to spider all our content at lightning speed, which led to caching, which led to the following problem:

our sheeple in the "product categorization department" would switch products around, one day this headphone was in computers, the next day it was in stereos, etc etc etc..

at the end of my work there we had 800,000 URLs redirected at ultra bobby-beaver speed, by having the URLs converted to crc32 and putting that into shared memory with my indexing tech.

Interestingly, we had a few duplicate crc32 values which gave us a different URL, so I had to switch to non-unique index, and implement a walk subroutine to search for the proper match.

I remember at that time I searched for a way to implement crc64, could not find the solution anyway, so I eventually ended up programming my own weirdness, sort of an md5sum, but mixing base64 chars too.

I will try to find the code to see if we can mix and match. At home I back up all my stuff into large internal drives on docking stations, (on my 15th so far, latest one is 6TB), and I built a postgres database thingy with php, that indexes all files according to their "first 5MB of md5sum" (so, if I have a 4gb video, the system does not spend an hour calculating the md5sum of the whole file). And it creates "delete orders", so if the data is duplicate, and names match, I always delete from the oldest drive first.

my 0.02$
Hi Bob,
Most interested in sharing stuff that you did with stuff that I did. We can do it off line by starting with the fedoraforum message facility.
__________________
Leslie in Montreal

Interesting web sites list
http://forums.fedoraforum.org/showth...40#post1697840
Reply With Quote
  #6  
Old 15th March 2017, 08:09 AM
bobx001 Offline
Registered User
 
Join Date: Dec 2012
Location: santa barbara, CA
Posts: 338
linuxfedorafirefox
Re: utilities optimisation

Quote:
Originally Posted by lsatenstein View Post
Hi Bob,
Most interested in sharing stuff that you did with stuff that I did. We can do it off line by starting with the fedoraforum message facility.
Absolutely, and I see that we are both old K&R strict indenters, a refreshing sight, most excellent !!!!
BTW, for C, PHP and HTML, I recommend Kate as an editor (say in Vim dark theme first, then you can build your own), the "Mini-map" is a godsend!. But of course you need to turn off indentation on it, cuz their indenting rules are totally off.
Reply With Quote
  #7  
Old 16th March 2017, 12:17 AM
lsatenstein Offline
Registered User
 
Join Date: Jun 2005
Location: Montreal, Que, Canada
Posts: 3,807
linuxchrome
Re: utilities optimisation

Quote:
Originally Posted by bobx001 View Post
Absolutely, and I see that we are both old K&R strict indenters, a refreshing sight, most excellent !!!!
BTW, for C, PHP and HTML, I recommend Kate as an editor (say in Vim dark theme first, then you can build your own), the "Mini-map" is a godsend!. But of course you need to turn off indentation on it, cuz their indenting rules are totally off.

Hi Bob, For coding I much prefer qt-creator. It has a built-in code reformatter. (CTL A CTL I )

And it has redeeming features.

Give it a try. I had used Kate, and feel that creator is more to my liking.
__________________
Leslie in Montreal

Interesting web sites list
http://forums.fedoraforum.org/showth...40#post1697840
Reply With Quote
  #8  
Old 16th March 2017, 09:01 AM
bobx001 Offline
Registered User
 
Join Date: Dec 2012
Location: santa barbara, CA
Posts: 338
linuxfedorafirefox
Re: utilities optimisation

Quote:
Originally Posted by lsatenstein View Post
Hi Bob, For coding I much prefer qt-creator. It has a built-in code reformatter. (CTL A CTL I )

And it has redeeming features.

Give it a try. I had used Kate, and feel that creator is more to my liking.

Giving it a try right now. Here is a side-by-side (the colors on the kate for PHP are my own selection):


Q: does it have a Mini-map like kate ?

for programs with thousands of lines, like the one shown, there, it's an imperative option. You just grab the little box and slide it to where you know you need to edit, and it even shows "changed lines" in a different color. VERY useful. I tried "sublime", but it defeats it's own purpose cuz it only shows a very small subset of the total amount of lines. Kate shows the complete program there, which is the way it should be.

Last edited by bobx001; 16th March 2017 at 09:11 AM.
Reply With Quote
  #9  
Old 16th March 2017, 01:46 PM
ocratato Online
Registered User
 
Join Date: Oct 2010
Location: Canberra
Posts: 2,476
linuxfirefox
Re: utilities optimisation

Don't forget that often the best optimisation is to consider a better algorithm.

For example, Leslie is using a sha1 cryptographic hash to identify duplicate files. This application doesn't need the crypto - a much simpler hash algorithm would perform the same task and do it much quicker.
__________________
Has anyone seriously considered that it might be turtles all the way down?
That's very old fashioned thinking.
The current model is that it's holographic nested virtualities of turtles, all the way down.
Reply With Quote
  #10  
Old 16th March 2017, 03:36 PM
bobx001 Offline
Registered User
 
Join Date: Dec 2012
Location: santa barbara, CA
Posts: 338
linuxfedorafirefox
Re: utilities optimisation

Quote:
Originally Posted by ocratato View Post
a much simpler hash algorithm would perform the same task and do it much quicker.
Bingo !
Reply With Quote
  #11  
Old 16th March 2017, 04:34 PM
DBelton Online
Administrator
 
Join Date: Aug 2009
Posts: 8,414
linuxfedorafirefox
Re: utilities optimisation

Quote:
Originally Posted by ocratato View Post
Don't forget that often the best optimisation is to consider a better algorithm.

For example, Leslie is using a sha1 cryptographic hash to identify duplicate files. This application doesn't need the crypto - a much simpler hash algorithm would perform the same task and do it much quicker.
That is pretty much all I would use sha1 for nowadays. It still works great for finding duplicate or corrupted files, but isn't really secure anymore.

Still, sha2 would probably be faster than sha1, but the difference would be very small. Probably not enough to even notice.

But, what I would do would be to look at file sizes first, and only do more in depth checking for files that are the same size. I would probably choose something like md5 for an algorithm, but everybody chooses differently Actually, I would probably check files sizes, and on files of the same size, I would probably then compare the first hundred bytes or so, then calculate a hash and compare. Simple compares are faster than hashing, and could rule out quite a few files as being duplicates.

Edit:

Oh, my preference for an editor is geany. I have been using it for several years now and love it.
Reply With Quote
  #12  
Old 16th March 2017, 07:58 PM
lsatenstein Offline
Registered User
 
Join Date: Jun 2005
Location: Montreal, Que, Canada
Posts: 3,807
linuxchrome
Smile Re: utilities optimisation

Everyone​ is right. In fact, I use a simple hashing algorithm that I found on Dr. Dobbs magazine from the 90's. In other cases I use md5. DB. I do a hash of file contents concatenated with the file size. It should be no surprise tha all zero length files have the same hash value.

My algorithm sorts by hash, followed by pathname. I have many thousands of files in my Fedora 25 system.
Where I do d duplicates most often is with my music library.
__________________
Leslie in Montreal

Interesting web sites list
http://forums.fedoraforum.org/showth...40#post1697840
Reply With Quote
Reply

Tags
optimisation, utilities

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Image optimisation drewsmith Using Fedora 1 11th April 2008 09:12 PM
rpmbuild as user: optimisation flag question a550ee Using Fedora 1 24th August 2006 01:31 PM
amd optimisation tanis Hardware & Laptops 5 11th June 2005 03:36 PM
Imaging utilities for FC3 nhusted Using Fedora 4 10th June 2005 05:17 PM


Current GMT-time: 02:01 (Friday, 24-03-2017)

TopSubscribe to XML RSS for all Threads in all ForumsFedoraForumDotOrg Archive
logo

All trademarks, and forum posts in this site are property of their respective owner(s).
FedoraForum.org is privately owned and is not directly sponsored by the Fedora Project or Red Hat, Inc.

Privacy Policy | Term of Use | Posting Guidelines | Archive | Contact Us | Founding Members

Powered by vBulletin® Copyright ©2000 - 2012, vBulletin Solutions, Inc.

FedoraForum is Powered by RedHat