Convert Open ICEcat XML feeds into MS SQL database tables.

Open Icecat is an a worldwide open content catalogue for product information. Started in December 2005 with 20 It companies participating, in January 2011 it counted 250+ manufacturers. This datasource of free product information is particulary favoured by other free open source platforms like WordPress and a host of free e-commerce environments running PHP.

High time to expose .NET envornments to this rich data source for consumer electronics product information. Having just 2 hours time to design, develop, code and test it, I needed an effective and efficient way to process XML directly.

XML to MS SQL database challenge

  • Convert XML datafeed (currently just under 300megs of data into a dataset.
  • Parse dataset for table structure(s).
  • If table(s) not in MS SQL DB then create table and structure.
  • Process each table row and import into the MS SQL database table.

Microsoft SMO (Server Management Objects) to the rescue.

Using SMO you can program all aspects of managing a Microsoft SQL server. Allowing the creation and generation of dynamic databases and tables. Just what was needed to create the tables in the XML on the SQL server. The second challenge was to populate the new datatables with the data in the XML dataset, here bulkcopy came to the rescue. Below is the VB.Net version 0.1 of the process that can turn any complex XML data structure into an instant MS SQL database.

Imports System.Xml

Imports System.Globalization

Imports Microsoft.SqlServer.Management.Smo

Imports Microsoft.SqlServer.Server

Imports Microsoft.SqlServer.Management.Common

Imports Microsoft.SqlServer

Imports System.Data.SqlClient

Imports Microsoft.VisualBasic

Imports System.IO

Imports System

Imports System.Xml.XPath

Imports System.Net

 

Public Class ICEcatProc

 

    '// As example

    '  MyURL = "https://data.icecat.biz/export/freexml.int/INT/files.index.xml"

 

    Public Shared Function ProcessXMLFeedURL(MyURL As String) As Boolean

        Dim OK As Boolean = False

        Try

            Dim rssReq As WebRequest = WebRequest.Create(MyURL)

            ' //Create a Proxy

            Dim px As WebProxy = New WebProxy(MyURL, True)

            '//Assign the proxy to the WebRequest

            rssReq.Proxy = px

            '//Set the timeout in Seconds for the WebRequest (increase for slow connections)

            rssReq.Timeout = 60000

            '// Setup autheentication (register for free at ICEcat to get a username and password needed to get access to the data)

            Dim username As String = "ICEcatUserName"

            Dim password As String = "ICEcatPassword"

            Dim encoded As String = System.Convert.ToBase64String(System.Text.Encoding.UTF8.GetBytes(username + ":" + password))

            rssReq.Headers.Add("Authorization", "Basic " + encoded)

            '//Get the WebResponse

            Dim rep As WebResponse = rssReq.GetResponse()

 

            '//Read the Response in a XMLTextReader

            Dim xtr As XmlTextReader = New XmlTextReader(rep.GetResponseStream())

 

            '// Set up the connection to the SQL server

            Dim MyConnectionString As String = "Data Source=......................................................"

            Dim Connection As SqlConnection = New SqlConnection(MyConnectionString)

            Dim MyServer As Server = New Server(New ServerConnection(Connection))

            Dim db As Database = New Database(MyServer, "openicecat")

            db.Create()

 

            '//Create a new DataSet

            Dim ds As DataSet = New DataSet()

            '//Read the Response into the DataSet

            ds.ReadXml(xtr)

            '// Parse tables

            For i As Integer = 0 To ds.Tables.Count - 1

                Dim Mytable As Table

                Dim MyTableName As String = ds.Tables(i).TableName

                If Not HaveTable(MyConnectionString, MyTableName) Then

                    '// Create the table

                    Try

                        Mytable = New Table(db, MyTableName)

                    Catch ex As Exception

                        Dim ii As Integer = 0

                    End Try

 

                    '// Create the columns

                    Dim Mycolumn As Column = New Column()

                    For Each dc As DataColumn In ds.Tables(i).Columns

                        Mycolumn = New Column(Mytable, dc.ColumnName)

                        Mycolumn.DataType = getdatatype(dc.DataType.ToString)

                        Mytable.Columns.Add(Mycolumn)

                    Next

                    Mytable.Create()

 

                    Dim PrimaryKeys() As DataColumn = ds.Tables(i).PrimaryKey

                    Dim PrimaryKey As DataColumn

                    For Each PrimaryKey In PrimaryKeys

                        Dim Myindex As Index = New Index(Mytable, PrimaryKey.ColumnName)

                        Myindex.IndexKeyType = IndexKeyType.DriPrimaryKey

                        Myindex.IndexedColumns.Add(New IndexedColumn(Myindex, PrimaryKey.ColumnName))

                        Mytable.Indexes.Add(Myindex)

                    Next

                End If

 

                Using MyConnection As SqlConnection = New SqlConnection(MyConnectionString)

                    MyConnection.Open()

                    Using bulkcopy As SqlBulkCopy = New SqlBulkCopy(MyConnection)

                        bulkcopy.DestinationTableName = "[" & MyTableName & "]"

                        Try

                            bulkcopy.WriteToServer(ds.Tables(i))

                        Catch ex As Exception

                            Dim iw As Integer = 0

                        End Try

                    End Using

                    MyConnection.Close()

                End Using

            Next

        Catch ex As Exception

            Throw ex '// Do errorhanddling here

        End Try

        Return OK

    End Function

 

    Shared Function getdatatype(Mydatatype As String) As DataType

        Dim dty As DataType = Nothing

        Select Case Mydatatype

            Case Is = "System.Decimal"

                dty = DataType.Decimal(2, 18)

            Case Is = "System.String"

                dty = DataType.VarChar(500)

            Case Is = "System.Int32"

                dty = DataType.Int

        End Select

        Return dty

    End Function

 

    Shared Function HaveTable(MyConnectionString As String, TableName As String) As Boolean

        Dim OK As Boolean = False

        Try

            Dim dbConn As New SqlConnection(MyConnectionString)

            dbConn.Open()

            Dim restrictions(3) As String

            restrictions(2) = TableName

            Dim dbTbl As DataTable = dbConn.GetSchema("Tables", restrictions)

            If dbTbl.Rows.Count > 0 Then

                OK = True

            End If

            dbTbl.Dispose()

            dbConn.Close()

            dbConn.Dispose()

        Catch ex As Exception

            Dim ss As Integer = 0

        End Try

        Return OK

    End Function

 

End Class

 

Conclusion

This routine processed more than a million data records in seven data tables in under a minute. The above routine can be used to turn any structured XML data feed (even RSS feeds) into a Microsoft SQL database.

Improvements: Read the schema from the XML and the table relations as well as foreign keys.

From here you can easily reference products from your own local copy of the Open ICEcat catalog. Do make sure to keep your database up to date as daily updates are provided by Open ICEcat.

More information about this open product information catalog can be found at http://icecat.nl/