Wednesday, August 29, 2012

Combine Matching Datasets

I have multiple text documents that contain pipe delimited data (|).  It would be much easier to work with if the information could be appended or otherwise combined into a single data file.

My data are currently stored as 16 individual .txt files, however it would be nice to know how to convert comma seperated data (.csv format),or any other simple text files as well.  We can easily combine these using a simple, one line DOS/command prompt command.

16 data files are a nightmare.

Merge documents from a DOS/command prompt:
  • Click Start :: Run, type cmd and hit enter to bring up a command prompt
  • Use CD (Change Directory) commands to navigate to the desired working directory of your individual datasets
  • Use the copy command to create a new file that contains a copy of all of the datasets together.  See below for syntax and examples
Syntax:

copy *.[format] [output_filename].[format]


In English this instruction says, "copy the contents of all files in this directory of a certain file format to a new file named whatever."

Example:

copy *.txt combined.txt



Command prompt view:
View of running the copy command

Output folder view:
A single, appended file is wonderful, but it'll need some minimal formatting

Remove headers stored as records:
The data are likely to have originally been exported with field names/headings in the first row.

Three individual text files with a header in the first
row that provides field names for each dataset

These will be included in the output file because the copy function appends every line of the originating datasets into a the new file.

The copy command appended the field headings
into the dataset.  These will need to be removed.
Removing these can be as simple as importing the file into spreadsheet processing software, such as Excel, and manually looking through the dataset for these lines and deleting any of those rows.  Of course we will leave the very first row of these alone since we will likely want one dataset header at the beginning of the file.  The remaining instance will otherwise be stored as actual records and that is incorrect.

Another way is to import the dataset into a database such as MS Access and run a SQL query to remove the erroneous records:

SQL syntax:

DELETE FROM [table name]
WHERE [field 1 name] = "[
field 1 name]" AND
      [field 2 name] = "[field 2 name]";


This will begin by attempting to delete rows from the table that was imported.  The query is limited to records where the data in a field matches the heading.  An AND statement is used to make sure it was not a coincidence that the data matched the field heading.  This can be repeated for each field in the dataset if necessary.  Here's an example from my data.

Example statement:

DELETE FROM GWISUpdates
WHERE station_id = "station_id" AND

      resource_type = "resource_type";