Any SQL-types out there? (more lab automation stuff)
Good morning! In the continuing saga of automating various activities where I work, I have taken on processing data files. (Automating data upload is still somewhere in the air.)... This is rather simple in concept, but a bit less obvious in practice (aren't they all!): I would like to open up a CSV file and rearrange the columns according to a standardized order. Not every data file will have all of the standard columns (different probes collect different data), but column headers are standardized. So, I need to inspect each column header so that I can place it and its entire column of data into the proper position in the table, with empty columns inserted where needed to preserve the standard sequence of columns. In other words: A, D, C, G, H becomes A, [B], C, D, [E], [F], G, H where [ ] denote inserted empty columns There are literally hundreds of files of data in CSV format, and I intend to process each by itself, preserving the original, and creating a new "adjusted" file for the next step. (The next step would be to concatenate all files together for numeric processing.) Original files must be preserved (chain of custody requirements), so I am opening the file, selecting records using SQL, and am now ready to do the rearranging/writing to a new file... I didn't know if there is a "well known" SQL way of doing this (e.g., data manipulation statements of some sort), so I thought I'd see if any WLUGgers are also SQL-types. I have a few ideas for how to get this done, but they are "programmer-centric" algorithms, which tend to be different from "SQL-centric" algorithms... The reason for SQL is that I have to do this on a MS platform because that is the standard. But, if most of the real work is done in SQL, then I can easily move to another platform in the future and only have to change the "shell code". I am using VBscript as the "shell language" because it works with all MS apps and file system objects inclusive of CSV files. Thanks for any help!!! Best regards, Steve
I think you'd be far better off using a scripting language such as perl, it's far easier to manipulate text files with perl. Michael Z Stephen Daukas wrote:
Good morning!
In the continuing saga of automating various activities where I work, I have taken on processing data files. (Automating data upload is still somewhere in the air.)...
This is rather simple in concept, but a bit less obvious in practice (aren't they all!):
I would like to open up a CSV file and rearrange the columns according to a standardized order. Not every data file will have all of the standard columns (different probes collect different data), but column headers are standardized. So, I need to inspect each column header so that I can place it and its entire column of data into the proper position in the table, with empty columns inserted where needed to preserve the standard sequence of columns. In other words:
A, D, C, G, H becomes A, [B], C, D, [E], [F], G, H where [ ] denote inserted empty columns
There are literally hundreds of files of data in CSV format, and I intend to process each by itself, preserving the original, and creating a new "adjusted" file for the next step. (The next step would be to concatenate all files together for numeric processing.) Original files must be preserved (chain of custody requirements), so I am opening the file, selecting records using SQL, and am now ready to do the rearranging/writing to a new file...
I didn't know if there is a "well known" SQL way of doing this (e.g., data manipulation statements of some sort), so I thought I'd see if any WLUGgers are also SQL-types. I have a few ideas for how to get this done, but they are "programmer-centric" algorithms, which tend to be different from "SQL-centric" algorithms...
The reason for SQL is that I have to do this on a MS platform because that is the standard. But, if most of the real work is done in SQL, then I can easily move to another platform in the future and only have to change the "shell code". I am using VBscript as the "shell language" because it works with all MS apps and file system objects inclusive of CSV files.
Thanks for any help!!!
Best regards, Steve ------------------------------------------------------------------------
_______________________________________________ Wlug mailing list Wlug@mail.wlug.org http://mail.wlug.org/mailman/listinfo/wlug
There are a lot of things I'd rather use, but I have to stick with the IT standards... On 7/25/07, Michael Zarozinski <MichaelZ@aiguru.com> wrote:
I think you'd be far better off using a scripting language such as perl, it's far easier to manipulate text files with perl.
Michael Z
Stephen Daukas wrote:
Good morning!
In the continuing saga of automating various activities where I work, I have taken on processing data files. (Automating data upload is still somewhere in the air.)...
This is rather simple in concept, but a bit less obvious in practice (aren't they all!):
I would like to open up a CSV file and rearrange the columns according to a standardized order. Not every data file will have all of the standard columns (different probes collect different data), but column headers are standardized. So, I need to inspect each column header so that I can place it and its entire column of data into the proper position in the table, with empty columns inserted where needed to preserve the standard sequence of columns. In other words:
A, D, C, G, H becomes A, [B], C, D, [E], [F], G, H where [ ] denote inserted empty columns
There are literally hundreds of files of data in CSV format, and I intend to process each by itself, preserving the original, and creating a new "adjusted" file for the next step. (The next step would be to concatenate all files together for numeric processing.) Original files must be preserved (chain of custody requirements), so I am opening the file, selecting records using SQL, and am now ready to do the rearranging/writing to a new file...
I didn't know if there is a "well known" SQL way of doing this (e.g., data manipulation statements of some sort), so I thought I'd see if any WLUGgers are also SQL-types. I have a few ideas for how to get this done, but they are "programmer-centric" algorithms, which tend to be different from "SQL-centric" algorithms...
The reason for SQL is that I have to do this on a MS platform because that is the standard. But, if most of the real work is done in SQL, then I can easily move to another platform in the future and only have to change the "shell code". I am using VBscript as the "shell language" because it works with all MS apps and file system objects inclusive of CSV files.
Thanks for any help!!!
Best regards, Steve ------------------------------------------------------------------------
_______________________________________________ Wlug mailing list Wlug@mail.wlug.org http://mail.wlug.org/mailman/listinfo/wlug
_______________________________________________ Wlug mailing list Wlug@mail.wlug.org http://mail.wlug.org/mailman/listinfo/wlug
Stephen Daukas wrote:
There are a lot of things I'd rather use, but I have to stick with the IT standards...
Most of my work has been in SQL Server shops where MS is the preferred technology. Having said that I would push for a Perl solution in this case because what you want to do does not lend itself to SQL easily. Perl will get the job done faster and cheaper. Not only that it is portable across platforms. You will have significantly less maintenance over the long term. -mike
I do a fair amount of data processing migrating between CSV and Oracle and mysql databases on both windows and linux and the weapon of choice on the windows side is ActiveState perl. You might want to persuade the IT people to let you use perl instead of VBscript. I can think of a couple compelling arguments right off the top of my head. 1. VBscript is crap! 2. Perl offers you amazing flexability to deal with CSV files. Either through direct manipulation using split/map/pack/unpack, to the CSV libraries, to the DBD libraries so you can treat the CSV file as a database and write SQL against it. 3. Perl offers equally great flexibility using DBI/DBD to interface and maniuplate databases, including MSSQL. 4. All your code will be immediately multiplatform. I'm sure it'll all fall on deaf eyes... just my two cents... Thanks, Tim. On 7/25/07, Stephen Daukas <scd@daukas.com> wrote:
There are a lot of things I'd rather use, but I have to stick with the IT standards...
On 7/25/07, Michael Zarozinski < MichaelZ@aiguru.com> wrote:
I think you'd be far better off using a scripting language such as perl,
it's far easier to manipulate text files with perl.
Michael Z
Stephen Daukas wrote:
Good morning!
In the continuing saga of automating various activities where I work, I have taken on processing data files. (Automating data upload is still somewhere in the air.)...
This is rather simple in concept, but a bit less obvious in practice (aren't they all!):
I would like to open up a CSV file and rearrange the columns according
to a standardized order. Not every data file will have all of the standard columns (different probes collect different data), but column headers are standardized. So, I need to inspect each column header so
that I can place it and its entire column of data into the proper position in the table, with empty columns inserted where needed to preserve the standard sequence of columns. In other words:
A, D, C, G, H becomes A, [B], C, D, [E], [F], G, H where [ ] denote inserted empty columns
There are literally hundreds of files of data in CSV format, and I intend to process each by itself, preserving the original, and creating a new "adjusted" file for the next step. (The next step would be to concatenate all files together for numeric processing.) Original files must be preserved (chain of custody requirements), so I
am opening the file, selecting records using SQL, and am now ready to do the rearranging/writing to a new file...
I didn't know if there is a "well known" SQL way of doing this ( e.g., data manipulation statements of some sort), so I thought I'd see if any WLUGgers are also SQL-types. I have a few ideas for how to get this done, but they are "programmer-centric" algorithms, which tend to
be different from "SQL-centric" algorithms...
The reason for SQL is that I have to do this on a MS platform because that is the standard. But, if most of the real work is done in SQL, then I can easily move to another platform in the future and only have to change the "shell code". I am using VBscript as the "shell language" because it works with all MS apps and file system objects inclusive of CSV files.
Thanks for any help!!!
Best regards, Steve
------------------------------------------------------------------------
_______________________________________________ Wlug mailing list Wlug@mail.wlug.org http://mail.wlug.org/mailman/listinfo/wlug
_______________________________________________ Wlug mailing list Wlug@mail.wlug.org http://mail.wlug.org/mailman/listinfo/wlug
_______________________________________________ Wlug mailing list Wlug@mail.wlug.org http://mail.wlug.org/mailman/listinfo/wlug
-- I am leary of the allegiances of any politician who refers to their constituents as "consumers".
On Wed, Jul 25, 2007 at 09:29:43AM -0400, Stephen Daukas wrote:
There are literally hundreds of files of data in CSV format, and I intend to process each by itself, preserving the original, and creating a new "adjusted" file for the next step. (The next step would be to concatenate all files together for numeric processing.) Original files must be preserved (chain of custody requirements), so I am opening the file, selecting records using SQL, and am now ready to do the rearranging/writing to a new file...
So you are selecting records out of a CSV file using SQL? How about this: select columnB,columnG,columnA,columnC,columnF from foo; That will take the columns and return them in the given order.
Chuck Anderson wrote:
On Wed, Jul 25, 2007 at 09:29:43AM -0400, Stephen Daukas wrote:
There are literally hundreds of files of data in CSV format, and I intend to process each by itself, preserving the original, and creating a new "adjusted" file for the next step. (The next step would be to concatenate all files together for numeric processing.) Original files must be preserved (chain of custody requirements), so I am opening the file, selecting records using SQL, and am now ready to do the rearranging/writing to a new file...
So you are selecting records out of a CSV file using SQL? How about this:
select columnB,columnG,columnA,columnC,columnF from foo;
That will take the columns and return them in the given order. _______________________________________________
That will work if the column order of the csv file is known. If the files come in with the columns in different orders and you have to examine the headers to determine the order or the data then it becomes more complex. If he really wants to do this in sql then you could bulk load the data into a worktable with all columns defined and char or varchar. When the data is loaded make sure you have a unique key for each row by defining one of the columns as an identity column. create table foo ( col1 identity(1,1), col2 varchar(25), col3 varchar(25) ) Once the data is imported you could right a sql script that will move the data based on the value found in the first record. if (select col2 from foo where col1 = 1) = 'myHeader' begin insert into myTable(myHeader) select col2 from foo end if (select col3 from foo where col1 = 1) = 'myHeader' begin insert into myTable(myHeader) select col3 from foo where col1 > 1 end Repeat this as necessary. Also be careful to keep the rows ordered as you move the data around. You will also need to reset the identity value between loads or come up with a method to deal with an incrementing number.
does awk compile on windows? On 7/25/07, Michael Long <mlong@datalong.com> wrote:
Chuck Anderson wrote:
On Wed, Jul 25, 2007 at 09:29:43AM -0400, Stephen Daukas wrote:
There are literally hundreds of files of data in CSV format, and I intend to process each by itself, preserving the original, and creating a new "adjusted" file for the next step. (The next step would be to concatenate all files together for numeric processing.) Original files must be preserved (chain of custody requirements), so I am opening the file, selecting records using SQL, and am now ready to do the rearranging/writing to a new file...
So you are selecting records out of a CSV file using SQL? How about this:
select columnB,columnG,columnA,columnC,columnF from foo;
That will take the columns and return them in the given order. _______________________________________________
That will work if the column order of the csv file is known. If the files come in with the columns in different orders and you have to examine the headers to determine the order or the data then it becomes more complex.
If he really wants to do this in sql then you could bulk load the data into a worktable with all columns defined and char or varchar. When the data is loaded make sure you have a unique key for each row by defining one of the columns as an identity column.
create table foo ( col1 identity(1,1), col2 varchar(25), col3 varchar(25) )
Once the data is imported you could right a sql script that will move the data based on the value found in the first record.
if (select col2 from foo where col1 = 1) = 'myHeader' begin insert into myTable(myHeader) select col2 from foo end
if (select col3 from foo where col1 = 1) = 'myHeader' begin insert into myTable(myHeader) select col3 from foo where col1 > 1 end
Repeat this as necessary. Also be careful to keep the rows ordered as you move the data around. You will also need to reset the identity value between loads or come up with a method to deal with an incrementing number.
_______________________________________________ Wlug mailing list Wlug@mail.wlug.org http://mail.wlug.org/mailman/listinfo/wlug
There is (was) a commercial package that had awk, sed, grep, et al. and when I used it (pre XP) it did a fine job. I can't remember the name off hand, but I think it originated in Israel. Since then, I have occasionally come across a package or two, but I never took a closer look since I get this automatically on Linux... Given the "rules" regarding standardization where I work, it wouldn't be an option for me... If it smells like a PC and looks like a PC, then the IT division owns it to the point where you have to get permission to change your desktop configuration! I can't even auto-update as I do not have administrator permissions... Steve On 7/25/07, Eric Martin <freak4uxxx@gmail.com> wrote:
does awk compile on windows?
On 7/25/07, Michael Long <mlong@datalong.com> wrote:
Chuck Anderson wrote:
On Wed, Jul 25, 2007 at 09:29:43AM -0400, Stephen Daukas wrote:
There are literally hundreds of files of data in CSV format, and I intend to process each by itself, preserving the original, and creating a new "adjusted" file for the next step. (The next step would be to concatenate all files together for numeric processing.) Original files must be preserved (chain of custody requirements), so I am opening the file, selecting records using SQL, and am now ready to do the rearranging/writing to a new file...
So you are selecting records out of a CSV file using SQL? How about this:
select columnB,columnG,columnA,columnC,columnF from foo;
That will take the columns and return them in the given order. _______________________________________________
That will work if the column order of the csv file is known. If the files come in with the columns in different orders and you have to examine the headers to determine the order or the data then it becomes more complex.
If he really wants to do this in sql then you could bulk load the data into a worktable with all columns defined and char or varchar. When the data is loaded make sure you have a unique key for each row by defining one of the columns as an identity column.
create table foo ( col1 identity(1,1), col2 varchar(25), col3 varchar(25) )
Once the data is imported you could right a sql script that will move the data based on the value found in the first record.
if (select col2 from foo where col1 = 1) = 'myHeader' begin insert into myTable(myHeader) select col2 from foo end
if (select col3 from foo where col1 = 1) = 'myHeader' begin insert into myTable(myHeader) select col3 from foo where col1 > 1 end
Repeat this as necessary. Also be careful to keep the rows ordered as you move the data around. You will also need to reset the identity value between loads or come up with a method to deal with an incrementing number.
_______________________________________________ Wlug mailing list Wlug@mail.wlug.org http://mail.wlug.org/mailman/listinfo/wlug
_______________________________________________ Wlug mailing list Wlug@mail.wlug.org http://mail.wlug.org/mailman/listinfo/wlug
-- Stephen C. Daukas Engineer, Geoscientist, Environmental Analyst ________________________________________________ - Please, before printing, think about the environment -
Hi Chuck! If I knew the order of the columns a priori, that would work, but sadly I don't. The number of columns, as well as the order, are random due to the fact that different sondes collect different data. In fact, my plan was to first determine the order of the columns in the source file, then I thought about using ALTER TABLE (in combination with SELECT column-of-interest) to put each column, one-by-one, into the new table in the proper order - last to first. If I select columns as you suggest, I still have to insert "empty" columns to maintain the proper column order, and I haven't found a way to insert a column into the middle of a table. (ALTER TABLE always creates a new column at the front of the table.) I was also thinking of biting the bullet and learning about MS JET (ODBC provider) attributes/field names as a way of discovering what fields (columns) are available, but I'm trying to keep this as close to "token handling" as possible so I can easily cleave MS-specific file stuff from the meat of the program. Correct me if I'm wrong on this - my SQL experience is on the order of 15 years old and I only just taught myself VB scrpt a few weeks ago... Steve On 7/25/07, Chuck Anderson <cra@wpi.edu> wrote:
On Wed, Jul 25, 2007 at 09:29:43AM -0400, Stephen Daukas wrote:
There are literally hundreds of files of data in CSV format, and I intend to process each by itself, preserving the original, and creating a new "adjusted" file for the next step. (The next step would be to concatenate all files together for numeric processing.) Original files must be preserved (chain of custody requirements), so I am opening the file, selecting records using SQL, and am now ready to do the rearranging/writing to a new file...
So you are selecting records out of a CSV file using SQL? How about this:
select columnB,columnG,columnA,columnC,columnF from foo;
That will take the columns and return them in the given order. _______________________________________________ Wlug mailing list Wlug@mail.wlug.org http://mail.wlug.org/mailman/listinfo/wlug
I received an incredibly useful suggestion from a fellow WLUGger that I am coming up to speed on. I'll fill you in when I'm confident I "get" what he is trying to "learn me". Its probably "SQL 101" for database types, but I'll provide a summary of what I end up with just the same. Definitely a SQL solution! Steve On 7/25/07, Stephen Daukas <scd@daukas.com> wrote:
Hi Chuck!
If I knew the order of the columns a priori, that would work, but sadly I don't. The number of columns, as well as the order, are random due to the fact that different sondes collect different data. In fact, my plan was to first determine the order of the columns in the source file, then I thought about using ALTER TABLE (in combination with SELECT column-of-interest) to put each column, one-by-one, into the new table in the proper order - last to first. If I select columns as you suggest, I still have to insert "empty" columns to maintain the proper column order, and I haven't found a way to insert a column into the middle of a table. (ALTER TABLE always creates a new column at the front of the table.)
I was also thinking of biting the bullet and learning about MS JET (ODBC provider) attributes/field names as a way of discovering what fields (columns) are available, but I'm trying to keep this as close to "token handling" as possible so I can easily cleave MS-specific file stuff from the meat of the program.
Correct me if I'm wrong on this - my SQL experience is on the order of 15 years old and I only just taught myself VB scrpt a few weeks ago...
Steve
On 7/25/07, Chuck Anderson <cra@wpi.edu> wrote:
On Wed, Jul 25, 2007 at 09:29:43AM -0400, Stephen Daukas wrote:
There are literally hundreds of files of data in CSV format, and I intend to process each by itself, preserving the original, and creating a new "adjusted" file for the next step. (The next step would be to concatenate all files together for numeric processing.) Original files must be preserved (chain of custody requirements), so I am opening the file, selecting records using SQL, and am now ready to do the rearranging/writing to a new file...
So you are selecting records out of a CSV file using SQL? How about this:
select columnB,columnG,columnA,columnC,columnF from foo;
That will take the columns and return them in the given order. _______________________________________________ Wlug mailing list Wlug@mail.wlug.org http://mail.wlug.org/mailman/listinfo/wlug
Stephen Daukas wrote:
Good morning!
I didn't know if there is a "well known" SQL way of doing this (e.g., data manipulation statements of some sort), so I thought I'd see if any WLUGgers are also SQL-types. I have a few ideas for how to get this done, but they are "programmer-centric" algorithms, which tend to be different from "SQL-centric" algorithms...
Not likely although one responder offered an idea. It seems to me you will have to do some text processing to bring the data into the alignment you want. Either this or force the application generating the data to maintain your structure (not a good choice). My preference to your type of issue has been to use awk (or possibly flex in combination with bison but this might be overkill). There are certainly tons of other tools available. You will be forced into the wonderful world of unix. /m
participants (7)
-
Chuck Anderson
-
Eric Martin
-
Mark Richards
-
Michael Long
-
Michael Zarozinski
-
Stephen Daukas
-
Tim Keller