8,000,000 lines, need to remove duplicates - HotUKDeals
We use cookie files to improve site functionality and personalisation. By continuing to use HotUKDeals, you accept our cookie and privacy policy.
Get the HotUKDeals app free at Google Play

Search Error

An error occurred when searching, please try again!

Login / Sign UpSubmit

8,000,000 lines, need to remove duplicates

£0.00 @
Got a file and is 8 million or so lines, need to remove duplicates but all methods seem to not work as it crashes, not really fancing seperating into 1000 files to do so, anyone know how? thanks Read More
whelan189 Avatar
9m, 6d agoPosted 9 months, 6 days ago
Got a file and is 8 million or so lines, need to remove duplicates but all methods seem to not work as it crashes, not really fancing seperating into 1000 files to do so, anyone know how? thanks
whelan189 Avatar
9m, 6d agoPosted 9 months, 6 days ago
Options

All Responses

(16) Jump to unreadPost an answer
Responses/page:
#1
What format is the file? Reading 8m lines s intensive. You'll likely need something like a proper database and query string
#2
darlodge
What format is the file? Reading 8m lines s intensive. You'll likely need something like a proper database and query string

just notepad text file, looks like iw ill have to seperate into several files to be able to use tools
#3
can you not import into excel?
#4
You can use a free ETL tool such as 'CloverETL'. Easy ramp up and small learning curve. Worthwhile skill to have.

http://www1.cloveretl.com/community-edition
#5
Notepad is horrendous with large files, it can't handle the RAM correctly. Excel can't read more than about 125k lines(from memory).

What format is the file? is it a csv delimited file? 8mil lines is a serious amount of data to query and process.

Something like MySQL would stand far more chance but you'd need to find the equivalent SQL query string rather than find and replace.
#6
Try dupeguru.
#7
actually a database load may be the quicklest solution. If it is already comma separated and you know the structure (read at command prompt using 'more filename.txt' for a preview of the data) then you can create a table and import the text file. Then just do a distinct on the table and export the result at txt. I'm more than happy to help if you want to forward the file.
#8
I'd suggest Kudaz text editor. I can't promise you it'll handle 8 million lines but I've used it on some pretty large files and it has a remove duplicate lines function:
http://www.portablefreeware.com/?id=1252
#9
Yes I agree get it into a csv or xls file and then use a vlookup formula to remove the duplicates or sort a-z and the duplicates will be easily seen but for 8m of them might be a slow process
#10
weenat2008
Yes I agree get it into a csv or xls file and then use a vlookup formula to remove the duplicates or sort a-z and the duplicates will be easily seen but for 8m of them might be a slow process

Excel can only handle 1 million rows.. you could however split the file into 1m row chunks if you wanted to go down this route.
#11
No idea if this would crash it, but you can use the following PowerShell command:

gc input.txt |select -unique > output.txt

Where input.txt is your current text file. This removes duplicate lines and write to a new text file called output.txt.
#12
Should be easy to make a quick script to remove duplicates using something like python or power shell.
You can do it straight from command prompt if your running Linux.
#13
What type of data does this TXT file contain and what is it's purpose ?
#14
Linux is your friend here. So, either get hold of somebody who has a laptop/desktop that is running Ubuntu or any other Linux distribution or alternatively install a distribution alongside your current windows (take care if you are not confident to do this)

Transfer the file from your system to the Linux machine, open up a terminal and issue the following command:-

sort file.txt | uniq > newfile.txt

should take a couple of mins at most - depending upon power of machine
Note this is removing duplicate lines in a text file - not duplicate records!

Removing duplicate records is a little more complex, but could be done depending upon the file/record structure.
#15
try notepad++
#16
I'd import the data into a mysql database - could just be each line into a cell (depending on content)

Then I would select the unique records into a new table.

Then export the new table to a text file.

Job done.

Post an Answer

You don't need an account to leave a response. Just enter your email address. We'll keep it private.

...OR log in with your social account

...OR comment using your social account

Thanks for your comment! Keep it up!
We just need to have a quick look and it will be live soon.
The community is happy to hear your opinion! Keep contributing!