Home - General Discussion - HUGE word-lists duplicate remover and merge tool


215 Results - Page 2 of 8 -
1 2 3 4 5 6 7 8
Author Message
Avatar
horriblecoders

Status: n/a
Joined: Sun, 04 Aug 2013
Posts: 8
Team:
Reputation: 3 Reputation
Offline
Sun, 27 Jul 2014 @ 07:37:43

blandyuk said:

horriblecoder, try it with a 14GB word-list and a 5GB word-list

That's the difference.

I have not tried the tool you posted because I don't have a windows box right now. I didn't know if it was faster than the code I posted or not. Just wanted to put that out there in case someone needed to merge lists on linux.


Avatar
blandyuk
Admin / Owner
Status: Trusted
Joined: Tue, 05 Jul 2011
Posts: 2916
Team: HashKiller
Reputation: 3911 Reputation
Offline
Sun, 27 Jul 2014 @ 10:29:06

I am looking at mono to port C#.NET over to linux so will hopefully get it working. I've also tried merging in linux with the standard tools and its slower.


Please read the forum rules | Please read the paid section rules
I accept private hash lists, with forum donations only.
BTC: 15qF9WUeFUD63ishxyAMiEgGqTcYzk4j9b
GPU Power: 7x GeForce GTX 1070 and My Brain

Avatar
ewwink

Status: n/a
Joined: Thu, 07 Aug 2014
Posts: 31
Team:
Reputation: 0 Reputation
Offline
Thu, 07 Aug 2014 @ 18:29:53

for easy management you should merge it with hashcatGui and thanks for the tools


Avatar
X-Attack!

Status: n/a
Joined: Wed, 03 Oct 2012
Posts: 20
Team: X-Attack!
Reputation: 13 Reputation
Offline
Fri, 15 Aug 2014 @ 00:50:31

Hmm, it doesnt work for me... I want to merge all *.txt files from the folder "Wordlists" to the file "Wordlist.txt" but it gives me a strange output:

Code:
M:\>App.Merge.exe o="Wordlist.txt" t=4 "Wordlists"
Merge Tool by BlandyUK - v0.2

Input file / dir does not exist: o=Wordlist.txt
Input files / dirs: 1
Combined filesize: 40527486856
Output file: Wordlist.txt
Total time: 0,0033124 seconds


I also accept private hash lists

x-attack.net is for sale now!

Avatar
blandyuk
Admin / Owner
Status: Trusted
Joined: Tue, 05 Jul 2011
Posts: 2916
Team: HashKiller
Reputation: 3911 Reputation
Offline
Fri, 15 Aug 2014 @ 08:45:53

Can u try with individual word-lists just to check it is working 4 u? Also, use absolute path locations.


Please read the forum rules | Please read the paid section rules
I accept private hash lists, with forum donations only.
BTC: 15qF9WUeFUD63ishxyAMiEgGqTcYzk4j9b
GPU Power: 7x GeForce GTX 1070 and My Brain

Avatar
30k

Status: n/a
Joined: Tue, 12 Aug 2014
Posts: 18
Team:
Reputation: 10 Reputation
Offline
Fri, 15 Aug 2014 @ 09:42:40

I'm getting the same as X-Attack.

Code:
Z:\>App.Merge.exe o="All_Combined_Sorted.dic" t=4 "All_Combined.dic"
Merge Tool by BlandyUK - v0.2
Input file / dir does not exist: o=All_Combined_Sorted.dic 
Input files / dirs: 1 
Combined filesize: 38509985225 
Output file: All_Combined_Sorted.dic 
Total time: 0,0033698 seconds

If using full paths.

Code:
Z:\>App.Merge.exe o="Z:\All_Combined_Sorted.dic" t=4 "Z:\All_Combined.dic"
Merge Tool by BlandyUK - v0.2
Input file / dir does not exist: o=All_Combined_Sorted.dic
Input files / dirs: 1
Combined filesize: 38509985225 
Output file: All_Combined_Sorted.dic 
Total time: 0,0021062 seconds



________________________________________
BTC: 1HUMD5LkAgfZh5PfWwZPeJ1Z4ERuX5ogfh

Avatar
blandyuk
Admin / Owner
Status: Trusted
Joined: Tue, 05 Jul 2011
Posts: 2916
Team: HashKiller
Reputation: 3911 Reputation
Offline
Fri, 15 Aug 2014 @ 10:05:52

Ah damnit! It was bugged. I have fixed so download and try again:

http://www.hashkiller.co.uk/downloads.aspx


Please read the forum rules | Please read the paid section rules
I accept private hash lists, with forum donations only.
BTC: 15qF9WUeFUD63ishxyAMiEgGqTcYzk4j9b
GPU Power: 7x GeForce GTX 1070 and My Brain

Avatar
30k

Status: n/a
Joined: Tue, 12 Aug 2014
Posts: 18
Team:
Reputation: 10 Reputation
Offline
Fri, 15 Aug 2014 @ 10:17:17

Allright, just ran on a 12Mb file, works & finds all dupes.

Will run it on a 38GB file now

Cheers.



________________________________________
BTC: 1HUMD5LkAgfZh5PfWwZPeJ1Z4ERuX5ogfh

Avatar
blandyuk
Admin / Owner
Status: Trusted
Joined: Tue, 05 Jul 2011
Posts: 2916
Team: HashKiller
Reputation: 3911 Reputation
Offline
Fri, 15 Aug 2014 @ 10:38:45

OK awesome. Post the results of the 38GB file once done. Will take awhile thou lol

The report feature is very useful although it only reports on word length atm. I'll see about including char-sets used soon.


Please read the forum rules | Please read the paid section rules
I accept private hash lists, with forum donations only.
BTC: 15qF9WUeFUD63ishxyAMiEgGqTcYzk4j9b
GPU Power: 7x GeForce GTX 1070 and My Brain

Avatar
30k

Status: n/a
Joined: Tue, 12 Aug 2014
Posts: 18
Team:
Reputation: 10 Reputation
Offline
Fri, 15 Aug 2014 @ 13:51:55

Progress update.

It just went through 100% of the file & is counting duplicates now.

It had skipped 261 lines. Which lines are skipped?

Will edit post when its finished


EDIT: literally 3 mins after making this post the proggy just closed itself :/

Not sure if it crashed due to a software error or something else.



________________________________________
BTC: 1HUMD5LkAgfZh5PfWwZPeJ1Z4ERuX5ogfh

Avatar
blandyuk
Admin / Owner
Status: Trusted
Joined: Tue, 05 Jul 2011
Posts: 2916
Team: HashKiller
Reputation: 3911 Reputation
Offline
Fri, 15 Aug 2014 @ 15:07:35

Skipped line are length based. Default min and max lengths are specified on first post.

OK, did u run in a .bat or .cmd file? If so, u need to put a "pause" after you command so u can catch any errors. You should be able to see the error in the Windows Event Viewer under Windows Logs -> Application.


Please read the forum rules | Please read the paid section rules
I accept private hash lists, with forum donations only.
BTC: 15qF9WUeFUD63ishxyAMiEgGqTcYzk4j9b
GPU Power: 7x GeForce GTX 1070 and My Brain

Avatar
30k

Status: n/a
Joined: Tue, 12 Aug 2014
Posts: 18
Team:
Reputation: 10 Reputation
Offline
Fri, 15 Aug 2014 @ 15:30:54

I was running just from command prompt.

Now I made a bat file and started it again.

Seems like it did crash according to the Event viewer.

Not sure if this will help you, but here you go

said:

Application: App.Merge.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.IO.IOException
Stack:
at System.IO.__Error.WinIOError(Int32, System.String)
at System.IO.FileStream.Init(System.String, System.IO.FileMode, System.IO.FileAccess, Int32, Boolean, System.IO.FileShare, Int32, System.IO.FileOptions, SECURITY_ATTRIBUTES, System.String, Boolean, Boolean, Boolean)
at System.IO.FileStream..ctor(System.String, System.IO.FileMode, System.IO.FileAccess, System.IO.FileShare, Int32, System.IO.FileOptions, System.String, Boolean)
at System.IO.FileStream..ctor(System.String, System.IO.FileMode, System.IO.FileAccess, System.IO.FileShare)
at App.Merge.Wordlist.Open(System.String)
at App.Merge.Program.readWordlist(System.String)
at App.Merge.Program.Main(System.String[])

said:

Faulting application name: App.Merge.exe, version: 1.0.0.0, time stamp: 0x53edcd47
Faulting module name: KERNELBASE.dll, version: 6.1.7601.18409, time stamp: 0x5315a05a
Exception code: 0xe0434352
Fault offset: 0x000000000000940d
Faulting process id: 0x2da4
Faulting application start time: 0x01cfb888d52f16f8
Faulting application path: Z:\App.Merge.exe
Faulting module path: C:\Windows\system32\KERNELBASE.dll
Report Id: 139feec1-247c-11e4-977d-001cc0b21279




________________________________________
BTC: 1HUMD5LkAgfZh5PfWwZPeJ1Z4ERuX5ogfh

Avatar
blandyuk
Admin / Owner
Status: Trusted
Joined: Tue, 05 Jul 2011
Posts: 2916
Team: HashKiller
Reputation: 3911 Reputation
Offline
Fri, 15 Aug 2014 @ 15:54:56

You do have enough space to process this list? Needs the size of the original list in space available at least. It's a file IO error so looks like space related OR another process locked out the files as they were being processed. Could also be security issue that caused it.

Trying a 2.83GB word-list myself now just in case...

UPDATE: It ran OK with no problems.


Please read the forum rules | Please read the paid section rules
I accept private hash lists, with forum donations only.
BTC: 15qF9WUeFUD63ishxyAMiEgGqTcYzk4j9b
GPU Power: 7x GeForce GTX 1070 and My Brain

Avatar
30k

Status: n/a
Joined: Tue, 12 Aug 2014
Posts: 18
Team:
Reputation: 10 Reputation
Offline
Fri, 15 Aug 2014 @ 23:05:48

ALLRIGHT.

Came home & it was done

Pretty wicked.

Workds Skipped: 261
Duplicates removed: 888 709 951 <: Holy sh*t?
Total time: 28477,0596341 seconds or (7 hours, 54 minutes and 37 seconds)

File went from 38GB to 28 GB

Now i'm going to count how many lines there is, but i estimate its going to be around 5 billion?



________________________________________
BTC: 1HUMD5LkAgfZh5PfWwZPeJ1Z4ERuX5ogfh

Avatar
blandyuk
Admin / Owner
Status: Trusted
Joined: Tue, 05 Jul 2011
Posts: 2916
Team: HashKiller
Reputation: 3911 Reputation
Offline
Fri, 15 Aug 2014 @ 23:40:22

38GB word-list is big! The 2.83GB one I did was small in comparison so not bad. That is a shit load of dups removed but with a word-lists that big, what tools do we have to deal with them? One has to deal with the word-lists in memory and on disc IO but try and optimize both to gain maximum performance.

I've recently written my own StringBuilder() function which uses less memory [RAM] then the built-in one that dotNET uses. I'll do some testing and if all is ok, (which it has been so far), I'll incorporate in the next version.

I did recently reduce a word-list from 24GB to 14GB because it contained all key-space upto 5 chars which is pointless. U might as well brute-force. Run your 38GB list through my report feature:

App.Merge r=[drive:\dir\file]

interesting results I imagine

(95 ^ 4) = 81,450,625 [mixalpha-num-sym] ~ data-space = 325,802,500
(62 ^ 5) = 916,132,832 [mixalpha-num] ~ data-space = 4,580,664,160
(36 ^ 6) = 2,176,782,336 [lowalpha-num] ~ data-space = 13,060,694,016

And the data-space is NOT including LF separator so add key-space on top of each of the above.


Please read the forum rules | Please read the paid section rules
I accept private hash lists, with forum donations only.
BTC: 15qF9WUeFUD63ishxyAMiEgGqTcYzk4j9b
GPU Power: 7x GeForce GTX 1070 and My Brain

Avatar
30k

Status: n/a
Joined: Tue, 12 Aug 2014
Posts: 18
Team:
Reputation: 10 Reputation
Offline
Sat, 16 Aug 2014 @ 08:36:22

Ran the report feature.

It went to 126 Length but this is the more important stuff.

Kind of feels useless to have anything above lengt 50.
I might resort & add max length.

EDIT: There is 2 489 325 709 lines in the sorted file.
Not bad i'd say



________________________________________
BTC: 1HUMD5LkAgfZh5PfWwZPeJ1Z4ERuX5ogfh

Avatar
c0ld

Status: n/a
Joined: Tue, 23 Jul 2013
Posts: 122
Team:
Reputation: 70 Reputation
Offline
Sat, 16 Aug 2014 @ 16:44:36

@BlandyUk tested on my ubuntu box, using "wine", and it work very well
many thanks


Avatar
blandyuk
Admin / Owner
Status: Trusted
Joined: Tue, 05 Jul 2011
Posts: 2916
Team: HashKiller
Reputation: 3911 Reputation
Offline
Sat, 16 Aug 2014 @ 17:45:10

Ah nice c0ld


Please read the forum rules | Please read the paid section rules
I accept private hash lists, with forum donations only.
BTC: 15qF9WUeFUD63ishxyAMiEgGqTcYzk4j9b
GPU Power: 7x GeForce GTX 1070 and My Brain

Avatar
30k

Status: n/a
Joined: Tue, 12 Aug 2014
Posts: 18
Team:
Reputation: 10 Reputation
Offline
Mon, 18 Aug 2014 @ 18:34:42

40GB file. Combined a shit ton of files.

Uuh, Negative Duplicates?

Is this telling me there was more than 2.147billion dupes (int) in my file? =D



________________________________________
BTC: 1HUMD5LkAgfZh5PfWwZPeJ1Z4ERuX5ogfh

Avatar
blandyuk
Admin / Owner
Status: Trusted
Joined: Tue, 05 Jul 2011
Posts: 2916
Team: HashKiller
Reputation: 3911 Reputation
Offline
Mon, 18 Aug 2014 @ 20:01:28

Haha that's a LOT of dups. I could change to a uint but tbh, I'll change to a ulong don't think you'll over-flow that. I'll get it changed and re-compiled...

UPDATE: I've re-compiled and added some more report features Plz download it again:

http://home.btconnect.com/md5decrypter/App.Merge.zip


Please read the forum rules | Please read the paid section rules
I accept private hash lists, with forum donations only.
BTC: 15qF9WUeFUD63ishxyAMiEgGqTcYzk4j9b
GPU Power: 7x GeForce GTX 1070 and My Brain

Avatar
30k

Status: n/a
Joined: Tue, 12 Aug 2014
Posts: 18
Team:
Reputation: 10 Reputation
Offline
Mon, 18 Aug 2014 @ 21:42:07

Thanx, I nowhere have enough lists to overflow that long

Any way I could perhaps have the source?

Would like to embed it in my Text Tool



________________________________________
BTC: 1HUMD5LkAgfZh5PfWwZPeJ1Z4ERuX5ogfh

Avatar
giveen

Status: n/a
Joined: Fri, 12 Jul 2013
Posts: 705
Team: Newbie Teaching Squad
Reputation: 385 Reputation
Offline
Thu, 16 Oct 2014 @ 17:17:29

Anyway to speed up processing?

I did t=7 and its only using 13% of my CPU to do the initial processing and combining 16GB of wordlist takes a while.


Right: 2x GTX 1050 TI

Avatar
blandyuk
Admin / Owner
Status: Trusted
Joined: Tue, 05 Jul 2011
Posts: 2916
Team: HashKiller
Reputation: 3911 Reputation
Offline
Thu, 16 Oct 2014 @ 17:41:16

giveen, it only uses multiple threads when sorting. Initial processing is single thread only


Please read the forum rules | Please read the paid section rules
I accept private hash lists, with forum donations only.
BTC: 15qF9WUeFUD63ishxyAMiEgGqTcYzk4j9b
GPU Power: 7x GeForce GTX 1070 and My Brain

Avatar
giveen

Status: n/a
Joined: Fri, 12 Jul 2013
Posts: 705
Team: Newbie Teaching Squad
Reputation: 385 Reputation
Offline
Thu, 16 Oct 2014 @ 17:43:30

blandyuk said:

giveen, it only uses multiple threads when sorting. Initial processing is single thread only

Any particular reason?


Right: 2x GTX 1050 TI

Avatar
div0x

Status: n/a
Joined: Wed, 17 Apr 2013
Posts: 21
Team:
Reputation: 12 Reputation
Offline
Thu, 16 Oct 2014 @ 18:21:19

@Blandyuk : can you explain your tool's algorithm ? i mean in full detail.
can you share the source ?


Avatar
giveen

Status: n/a
Joined: Fri, 12 Jul 2013
Posts: 705
Team: Newbie Teaching Squad
Reputation: 385 Reputation
Offline
Thu, 16 Oct 2014 @ 18:29:49


Merge complete to: biggest_list.txt
Duplicates removed: 395029696
Total time: 5924.8924615 seconds


Sorry, it was a 12GB of lists, reduced to 10GB's.


Right: 2x GTX 1050 TI

Avatar
blandyuk
Admin / Owner
Status: Trusted
Joined: Tue, 05 Jul 2011
Posts: 2916
Team: HashKiller
Reputation: 3911 Reputation
Offline
Thu, 16 Oct 2014 @ 21:42:49

giveen
Multi-threading an application is not so simple depending on the work each thread has to do. One has to allocate work for a particular thread only so this is easy when sorting individual files which is where my threading comes into play. When reading a word-list, I have to store each word in an array which is accessible throughout the class. Because of this, threading is not possible because it causes problems due to class wide variable.

Reading a word-list to check against a particular hash or hashes can be threaded no problem as I don't have a problem as each thread can run independently.

div0x
I might release the source sooner of later but not at the moment.


Please read the forum rules | Please read the paid section rules
I accept private hash lists, with forum donations only.
BTC: 15qF9WUeFUD63ishxyAMiEgGqTcYzk4j9b
GPU Power: 7x GeForce GTX 1070 and My Brain

Avatar
giveen

Status: n/a
Joined: Fri, 12 Jul 2013
Posts: 705
Team: Newbie Teaching Squad
Reputation: 385 Reputation
Offline
Thu, 16 Oct 2014 @ 22:03:56

Thanks for the clear-up


Right: 2x GTX 1050 TI

Avatar
blandyuk
Admin / Owner
Status: Trusted
Joined: Tue, 05 Jul 2011
Posts: 2916
Team: HashKiller
Reputation: 3911 Reputation
Offline
Thu, 16 Oct 2014 @ 22:08:36

Saying that thou... I might have just thought of a way around this so I'll look at adding the threading to the initial processing


Please read the forum rules | Please read the paid section rules
I accept private hash lists, with forum donations only.
BTC: 15qF9WUeFUD63ishxyAMiEgGqTcYzk4j9b
GPU Power: 7x GeForce GTX 1070 and My Brain

Avatar
h0wler

Status: Elite
Joined: Tue, 08 Nov 2011
Posts: 309
Team:
Reputation: 257 Reputation
Offline
Fri, 17 Oct 2014 @ 01:12:48

Using your tool today i went from a 20gig word list to an 11 gig list.

Should save some time. :-)



BTC: 16c3rG8EwyNXHDKtCWPtediC3NrVUQhu7M


215 Results - Page 2 of 8 -
1 2 3 4 5 6 7 8

We have a total of 148426 messages in 18357 topics.
We have a total of 18219 registered users.
Our newest registered member is OrlandoX.