Home - General Discussion - HUGE word-lists duplicate remover and merge tool


214 Results - Page 1 of 8 -
1 2 3 4 5 6 7 8
Author Message
Avatar
blandyuk
Admin / Owner
Status: Trusted
Joined: Tue, 05 Jul 2011
Posts: 2787
Team: HashKiller
Reputation: 3701 Reputation
Offline
Mon, 31 Mar 2014 @ 23:05:02

OK, so I've started writing my own word-list merge and duplicate remover already done 2 word-lists which are nearly 2 GB each. It obviously trades the use of your local HD instead of memory but it does chunk the data in memory before flushing it to files on local HD which improves performance massively. You will need to make sure you have enough local HD space = SUM(all word-lists).

Processing files are sorted in: [BaseDirectory]\tmp

Finished file is not sorted properly but are sorted in 256 chunks, 00 to ff based on HEX value of plain-text.

Example:

passwords.txt
Original Size: 184 MB
Converted Size: 169 MB
Duplicates Removed: 28
Run-time: 129 seconds

Obviously we can merge multiple files but I'm just processing 1 file for the above example. Takes awhile but I'm working on improving it. Anyone interested in trying it, please see below:

http://home.btconnect.com/md5decrypter/App.Merge.zip [21.4 KB]

Command format:

App.Merge.exe o="output-file.txt" t=4 [options] ... "word-list1.txt" "word-list2.lst" "directory1" ...

For a report analysis of a word-list:

App.Merge.exe r="word-list1.txt"

Double-quotes required for path / file names which contain spaces. You can also specify directory paths if you wish to merge / sort whole directories.

o=[out-file] - Output file.
t=[threads] - Used to speed sorting up only.
c=[mem] - Used to control how much RAM memory to use in MB. Default is 1024. Capped at 3072.
min=[num] - Minimum word length. Default = 1
max=[num] - Maximum word length. Default = 4096.

Words containing control characters will be converted into the Hashcat HEX format: $HEX[...]


Please read the forum rules | Please read the paid section rules
I accept private hash lists, with forum donations only.
BTC: 15qF9WUeFUD63ishxyAMiEgGqTcYzk4j9b
GPU Power: 5x GeForce GTX 1070 and My Brain

Avatar
blandyuk
Admin / Owner
Status: Trusted
Joined: Tue, 05 Jul 2011
Posts: 2787
Team: HashKiller
Reputation: 3701 Reputation
Offline
Mon, 31 Mar 2014 @ 23:14:37

More examples:

output.txt
Original Size: 511 MB
Converted Size: 462 MB
Duplicates Removed: 5721
Run-time: 309 seconds (5 mins)

tmto.txt
Original Size: 1.72 GB
Converted Size: 1.55 GB
Duplicates Removed: 4570052
Run-time: 951 seconds (16 mins)

tmto.txt + insidepro.txt
Original Size: 1.55 GB + 1.46 GB (3.01 GB)
Converted Size: 2.63 GB
Duplicates Removed: 37024133
Run-time: 1999 seconds (33 mins)


Please read the forum rules | Please read the paid section rules
I accept private hash lists, with forum donations only.
BTC: 15qF9WUeFUD63ishxyAMiEgGqTcYzk4j9b
GPU Power: 5x GeForce GTX 1070 and My Brain

Avatar
yhi

Status: n/a
Joined: Tue, 05 Nov 2013
Posts: 875
Team:
Reputation: 365 Reputation
Offline
Tue, 01 Apr 2014 @ 08:33:15

gonna try it soon


Avatar
blandyuk
Admin / Owner
Status: Trusted
Joined: Tue, 05 Jul 2011
Posts: 2787
Team: HashKiller
Reputation: 3701 Reputation
Offline
Tue, 01 Apr 2014 @ 08:56:46

Your word-lists are WAY too small! This tool is for LARGE word-lists. Will post a fix but it works better with large lists.

UPDATE: Fixed. Please download a new copy and try again.


Please read the forum rules | Please read the paid section rules
I accept private hash lists, with forum donations only.
BTC: 15qF9WUeFUD63ishxyAMiEgGqTcYzk4j9b
GPU Power: 5x GeForce GTX 1070 and My Brain

Avatar
yhi

Status: n/a
Joined: Tue, 05 Nov 2013
Posts: 875
Team:
Reputation: 365 Reputation
Offline
Tue, 01 Apr 2014 @ 09:39:11

blandyuk said:

Your word-lists are WAY too small! This tool is for LARGE word-lists. Will post a fix but it works better with large lists.

UPDATE: Fixed. Please download a new copy and try again.


yup 1 know they both are 1 kb
just created them for testing

Now its working fine

Awesome work dude

UPDATE:

while trying to remove duplicate & merging i found that its skipping " + " sign


Avatar
blandyuk
Admin / Owner
Status: Trusted
Joined: Tue, 05 Jul 2011
Posts: 2787
Team: HashKiller
Reputation: 3701 Reputation
Offline
Tue, 01 Apr 2014 @ 11:45:26

Running a 13.4 GB word-list through it now

Can you explain in a bit more details about the " + " issue plz yhi. Also, thanks for the feedback +rep


Please read the forum rules | Please read the paid section rules
I accept private hash lists, with forum donations only.
BTC: 15qF9WUeFUD63ishxyAMiEgGqTcYzk4j9b
GPU Power: 5x GeForce GTX 1070 and My Brain

Avatar
yhi

Status: n/a
Joined: Tue, 05 Nov 2013
Posts: 875
Team:
Reputation: 365 Reputation
Offline
Tue, 01 Apr 2014 @ 15:19:13

blandyuk said:

Running a 13.4 GB word-list through it now

Can you explain in a bit more details about the " + " issue plz yhi. Also, thanks for the feedback +rep

Check ur PM & thxx for + rep

+10 for such a awesome tool
it helped me in removing 420 MB duplicates


Avatar
blandyuk
Admin / Owner
Status: Trusted
Joined: Tue, 05 Jul 2011
Posts: 2787
Team: HashKiller
Reputation: 3701 Reputation
Offline
Tue, 01 Apr 2014 @ 22:15:38

yhi

I've fixed a couple of bugs which will have affected the output so you'll need to run it again on your original word-lists. Your issues from before should be fixed

EDIT: Been thinking about adding more features to this app:

  • Splitting single / multiple word-lists by length. Range [1-64] and [other]. Output files = [InFilename]_len[x].[ext]
  • Having additional options for the above so you can say export all words with a length of 8 only as an example.
I know ULM does this already but I'll be adding anyway due to my app supporting huge word-lists.


Please read the forum rules | Please read the paid section rules
I accept private hash lists, with forum donations only.
BTC: 15qF9WUeFUD63ishxyAMiEgGqTcYzk4j9b
GPU Power: 5x GeForce GTX 1070 and My Brain

Avatar
yhi

Status: n/a
Joined: Tue, 05 Nov 2013
Posts: 875
Team:
Reputation: 365 Reputation
Offline
Wed, 02 Apr 2014 @ 09:39:37

blandyuk said:

yhi

I've fixed a couple of bugs which will have affected the output so you'll need to run it again on your original word-lists. Your issues from before should be fixed

EDIT: Been thinking about adding more features to this app:

  • Splitting single / multiple word-lists by length. Range [1-64] and [other]. Output files = [InFilename]_len[x].[ext]
  • Having additional options for the above so you can say export all words with a length of 8 only as an example.
I know ULM does this already but I'll be adding anyway due to my app supporting huge word-lists.

Waiting for updated version


Avatar
blandyuk
Admin / Owner
Status: Trusted
Joined: Tue, 05 Jul 2011
Posts: 2787
Team: HashKiller
Reputation: 3701 Reputation
Offline
Wed, 02 Apr 2014 @ 10:01:51

It's already available with the fixes I mention:

http://home.btconnect.com/md5decrypter/App.Merge.zip

Extra features will come in next release of which I'll start using BETA version numbering on it.


Please read the forum rules | Please read the paid section rules
I accept private hash lists, with forum donations only.
BTC: 15qF9WUeFUD63ishxyAMiEgGqTcYzk4j9b
GPU Power: 5x GeForce GTX 1070 and My Brain

Avatar
K9

Status: n/a
Joined: Sat, 30 Jul 2011
Posts: 113
Team:
Reputation: 38 Reputation
Offline
Wed, 02 Apr 2014 @ 19:09:18

Filesize: 296 MB

Total time: 22,3375018 seconds
Cygwin sort -u: 53.014 seconds
Cygwin sort: 53.146 seconds
awk '!a[$0]++': 6.860 seconds


Avatar
blandyuk
Admin / Owner
Status: Trusted
Joined: Tue, 05 Jul 2011
Posts: 2787
Team: HashKiller
Reputation: 3701 Reputation
Offline
Wed, 02 Apr 2014 @ 19:18:34

It now does a FULL sort so please download again an test

K9 - Get awk to do a 3-4GB word-list Looks like it processing in memory which it cannot do with huge word-lists.


Please read the forum rules | Please read the paid section rules
I accept private hash lists, with forum donations only.
BTC: 15qF9WUeFUD63ishxyAMiEgGqTcYzk4j9b
GPU Power: 5x GeForce GTX 1070 and My Brain

Avatar
K9

Status: n/a
Joined: Sat, 30 Jul 2011
Posts: 113
Team:
Reputation: 38 Reputation
Offline
Wed, 02 Apr 2014 @ 19:51:41

Total time: 43,8421534 seconds


Avatar
c0ld

Status: n/a
Joined: Tue, 23 Jul 2013
Posts: 122
Team:
Reputation: 70 Reputation
Offline
Sat, 19 Apr 2014 @ 21:05:40

work perfectly on windows
but what about this tool for linux ?


Avatar
blandyuk
Admin / Owner
Status: Trusted
Joined: Tue, 05 Jul 2011
Posts: 2787
Team: HashKiller
Reputation: 3701 Reputation
Offline
Tue, 08 Jul 2014 @ 21:38:56

Revisiting this as I've added it to my downloads page now. I've updated my original post with updated features so please give it a try and see what you think


Please read the forum rules | Please read the paid section rules
I accept private hash lists, with forum donations only.
BTC: 15qF9WUeFUD63ishxyAMiEgGqTcYzk4j9b
GPU Power: 5x GeForce GTX 1070 and My Brain

Avatar
giveen

Status: n/a
Joined: Fri, 12 Jul 2013
Posts: 697
Team: Newbie Teaching Squad
Reputation: 385 Reputation
Offline
Tue, 08 Jul 2014 @ 22:04:47

About to try this on 9.5GB of my biggest word lists.


Right: 4x R9-270(x)

Avatar
c0ld

Status: n/a
Joined: Tue, 23 Jul 2013
Posts: 122
Team:
Reputation: 70 Reputation
Offline
Tue, 08 Jul 2014 @ 22:05:02

@blandy we need a linux's one


Avatar
giveen

Status: n/a
Joined: Fri, 12 Jul 2013
Posts: 697
Team: Newbie Teaching Squad
Reputation: 385 Reputation
Offline
Wed, 09 Jul 2014 @ 05:14:22


Merge complete to: big_list.txt
Words skipped: 0
Duplicates removed: 135225094
Total time: 4773.2977014 seconds


Right: 4x R9-270(x)

Avatar
blandyuk
Admin / Owner
Status: Trusted
Joined: Tue, 05 Jul 2011
Posts: 2787
Team: HashKiller
Reputation: 3701 Reputation
Offline
Wed, 09 Jul 2014 @ 08:22:24

giveen certainly removed a lot of duplicates for u


Please read the forum rules | Please read the paid section rules
I accept private hash lists, with forum donations only.
BTC: 15qF9WUeFUD63ishxyAMiEgGqTcYzk4j9b
GPU Power: 5x GeForce GTX 1070 and My Brain

Avatar
cyberoot1

Status: n/a
Joined: Tue, 08 Jul 2014
Posts: 3
Team:
Reputation: 0 Reputation
Offline
Wed, 09 Jul 2014 @ 11:08:22

It really help me a lot bro.

We reduce the time.

Very Useful tool


Avatar
giveen

Status: n/a
Joined: Fri, 12 Jul 2013
Posts: 697
Team: Newbie Teaching Squad
Reputation: 385 Reputation
Offline
Wed, 09 Jul 2014 @ 12:10:09

blandyuk said:

giveen certainly removed a lot of duplicates for u

Yeah, it worried me a bit that it got rid of passwords that are not duplicates.


Right: 4x R9-270(x)

Avatar
giveen

Status: n/a
Joined: Fri, 12 Jul 2013
Posts: 697
Team: Newbie Teaching Squad
Reputation: 385 Reputation
Offline
Wed, 09 Jul 2014 @ 13:24:21


Merge complete to: The_New_New_Best_Dict.txt
Words skipped: 2
Duplicates removed: 44668857
Total time: 226.8911617 seconds


Right: 4x R9-270(x)

Avatar
musa

Status: n/a
Joined: Thu, 19 Jun 2014
Posts: 11
Team:
Reputation: 0 Reputation
Offline
Wed, 09 Jul 2014 @ 16:19:45

hello dear error


Avatar
blandyuk
Admin / Owner
Status: Trusted
Joined: Tue, 05 Jul 2011
Posts: 2787
Team: HashKiller
Reputation: 3701 Reputation
Offline
Wed, 09 Jul 2014 @ 17:19:04

musa

Read the first post of this topic it explains the command-line parameters you need to use. You cannot just run the app. Also, you'll need dotNET v4.0 installed for it to work.


Please read the forum rules | Please read the paid section rules
I accept private hash lists, with forum donations only.
BTC: 15qF9WUeFUD63ishxyAMiEgGqTcYzk4j9b
GPU Power: 5x GeForce GTX 1070 and My Brain

Avatar
cvsi
Moderator
Status: Trusted
Joined: Fri, 23 May 2014
Posts: 2054
Team: CynoSure Prime
Reputation: 2872 Reputation
Offline
Thu, 10 Jul 2014 @ 01:12:02

I ran this on a dictionary that it reported back saying there were no duplicates, yet the output file is 50MB smaller then the original. So did it in fact remove words from the list?
Its actually the same list that giveen ran earlier.


Please read the forum rules. | Please read the paid section rules.

280x, 390x
GTX 1080 Ti , GTX 1080 , GTX 1070 Everything watercooled

BTC - 1As13jsySvbN5wjcNJP3AASiazDX9pVdVw

Avatar
VTSTech

Status: n/a
Joined: Fri, 18 Jul 2014
Posts: 205
Team: VTSTech
Reputation: 168 Reputation
Offline
Sat, 19 Jul 2014 @ 04:20:46

I'll just leave this here

C:\Tools\oclHashcat-1.21>copy rockyou.txt +english.txt rockenglish.txt
rockyou.txt
english.txt
1 file(s) copied.


VTS-Tech.org Veritas Technical Solutions | XMPP VTSTech@jabber.ccc.de/veritas@creep.im BTC 1VTSgzD24bjkSGdD7kvauxkxHZ4yiwhdU

Avatar
blandyuk
Admin / Owner
Status: Trusted
Joined: Tue, 05 Jul 2011
Posts: 2787
Team: HashKiller
Reputation: 3701 Reputation
Offline
Sat, 19 Jul 2014 @ 10:14:40

cvsi

If your list uses \r\n as the line separator, I change it to just \n so this in turn reduces the filesize.

VTSech

Doing that only merges files, it does not sort OR remove duplicates.


Please read the forum rules | Please read the paid section rules
I accept private hash lists, with forum donations only.
BTC: 15qF9WUeFUD63ishxyAMiEgGqTcYzk4j9b
GPU Power: 5x GeForce GTX 1070 and My Brain

Avatar
horriblecoders

Status: n/a
Joined: Sun, 04 Aug 2013
Posts: 8
Team:
Reputation: 3 Reputation
Offline
Sat, 26 Jul 2014 @ 20:49:00

c0ld said:

@blandy we need a linux's one

Just use the command:

cat wordlist1.txt wordlist2.txt wordlist3.txt | sort | uniq -u > output.txt


Avatar
c0ld

Status: n/a
Joined: Tue, 23 Jul 2013
Posts: 122
Team:
Reputation: 70 Reputation
Offline
Sat, 26 Jul 2014 @ 21:19:57

@horriblecoders it won't work at the same speed and resource


Avatar
blandyuk
Admin / Owner
Status: Trusted
Joined: Tue, 05 Jul 2011
Posts: 2787
Team: HashKiller
Reputation: 3701 Reputation
Offline
Sun, 27 Jul 2014 @ 00:45:10

horriblecoder, try it with a 14GB word-list and a 5GB word-list

That's the difference.


Please read the forum rules | Please read the paid section rules
I accept private hash lists, with forum donations only.
BTC: 15qF9WUeFUD63ishxyAMiEgGqTcYzk4j9b
GPU Power: 5x GeForce GTX 1070 and My Brain


214 Results - Page 1 of 8 -
1 2 3 4 5 6 7 8

We have a total of 120054 messages in 14489 topics.
We have a total of 15385 registered users.
Our newest registered member is abdohk.