Database Scanner

Introduction

The Database scanner can access SQL compliant databases and extract information via user specified SQL SELECT statements.

The Database scanner can extract BLOB or CLOB content from a database or use the content files on disk, get their path via the query and use it during the transformation process via the mc_content_location system attribute.

JDBC driver requirements

The database scanner accesses databases via a JDBC driver.

For scanning Oracle and Postgres databases the JDBC drivers are already included in the Jobserver.

For scaning any other type of Database the specific JDBC driver for that database needs to be installed/downloaded and then configured for the Job Server.

To configure a JDBC driver all required Java jar files and Java library paths need to be added to the ... \migration-center Server Components \jdbc.conf file.

The file contains all necessary information for the configurations in the comments.

Exporting objects from a database

Since many content management systems rely on some type of SQL compliant database to manage their data, the Database scanner can also be used as a generic interface for extracting information from unsupported/obsolete/custom built content management systems. The types of database management systems supported are not limited to any particular vendor or brand, as access happens via JDBC, hence any SQL compliant database having a (compatible) JDBC adapter available can be accessed and queried.

Some common content management system features supported by migration-center Database scanner are metadata, including system metadata such as permission information, owner, creator, content path, etc, as well as version information, as long as these types of information can be obtained from the respective system’s SQL database. The information extracted by the database scanner is stored in the migration-center database and can be processed, transformed, validated and imported just like any other type of scanned information.

Note that the Database Scanner can extract content from a database stored in BLOB/CLOB fields. Alternatively, the content files corresponding to the objects can be specified by the user during the transformation process via the mc_content_location system attribute.

Depending on the way the content is stored, it may be necessary to extract the content to the filesystem first by other means before migration-center can process it. For the mc_content_location system attribute any of the available transformation functions can be used, so it is easy to generate a value resembling a path pointing to the location of the object’s content file, which a migration-center importer can use to import the content. A good practice would be to export content files to the filesystem using the object’s unique identifier as the filename, and then build the path information based on the known path and the objects unique identifier. This location would need to be accessible to the Job Server running the import which will migrate the content to the new target system.

Scanner Configuration

To create a new database Scanner job, specify the respective adapter type in the Scanner Properties window – from the list of available connectors, “Database” must be selected. Once the adapter type has been selected, the Parameters list will be populated with the parameters specific to the selected adapter type, in this case the Database Scanner.

The Properties window of a scanner can be accessed by double-clicking a scanner in the list or selecting the Properties button or entry from the toolbar or context menu.

Scanner Parameters

The common adaptor parameters are described in Common Parameters.

The configuration parameters available for the Database Scanner are described below:

  • connectionURL*

    The database connection URL that is a string that your DBMS JDBC driver uses to connect to a database. It can contain information such as where to search for the database, the name of the database to connect to, and configuration properties. The exact syntax of a database connection URL is specified by your DBMS.

    Example connection strings for some common databases:

    • jdbc:oracle:thin:@[host][:port]:SID

    • jdbc:sqlserver://[serverName[\instanceName][:portNumber]][;property=value[;property=value]]

    • jdbc:mysql://host_name:port/dbname

  • driverClass*

    The JDBC driver entry class that is the class the implement the interface java.sql.Driver.

    Examples:

    • oracle.jdbc.OracleDriver

    • com.microsoft.sqlserver.jdbc.SQLServerDriver

    • sun.jdbc.odbc.JdbcOdbcDriver

  • username*

    Database username used for jdbc connection.

  • password*

    Password used for jdbc connection.

  • queryFile*

    The xml file path that contains the SQL queries that will be used by scanner to extract objects and metadata from database.

    See the Queries Configuration for more details about configuring queries.

  • scanUpdates*

    Enables the scanner to update previously scanned objects in the mc database. If unticked, previous scanned objects will be skipped. See Delta migration.

  • deltaFields

    Contains the fields that will be used for detecting if an object needs to be scanned as an update. Taken in consideration only when “scanUpdates” is checked. See Delta Migration.

  • computeChecksum

    When it's checked the checksum of scanned files will be computed. Useful for determining whether files with different names and from different locations have in fact the same content.

    Do not enable this option unless necessary, since the performance impact is significant due to the scanner having to read the full content for each and compute the checksum for it.

  • hashAlgorithm

    Specifies the algorithm that will be used to compute the checksum of the scanned objects.

    Possible values are "MD2", "MD5", "SHA-1", "SHA-224", "SHA-256", "SHA-384" and "SHA-512". Default value is MD5.

  • hashEncoding

    Specifies the encoding that will be used to compute the Checksum of the scanned objects.

    Possible values are "HEX", "Base32" and "Base64". Default value is HEX.

  • exportLocation

    The location where the exported object content should be saved. It can be a job server local folder or a shared folder and it should exist and it should be writable.

  • loggingLevel*

    See Common Parameters.

Parameters marked with an asterisk (*) are mandatory.

Query File Configuration

Any queries you want the Database Scanner to run need to be set in an XML file and the path to it set in the queryFile parameter in the scanner.

The database scanner will need at least one main query for which it may chain different optional queries deppending on the desired mode of operation.

The scanner has two main modes of operating: scanning unversioned objects or versioned objects.

Unversioned Objects

There are two ways of extracting metadata for database objects that don't have versions:

Single "main" query

<queries>
    <query type="main" key="uid">
        select uid, name, title, subject from docs.my_table where folder like '/test%'
    </query>
</queries>

Using just a "main" query that contains the unique identifier for the objects and all the columns containing the object's metadata.

The column that represents the unique ID must be stated explicitly in the select statement and also specified in the key element of the query. This value must be unique across all objects you intend to scan with migration-center. I will be used as the id_in_source_system.

"main" and "main-metadata" queries

<queries>
    <query type="main" key="uid">
        select uid from docs.my_table where folder like '/test%'
    </query>
    <query type="main-metadata">
        select name, title, subject from docs.my_table where uid = ?
    </query>
    <query type="main-metadata">
        select extra_metadata, special_info from docs.other_table where uid = ?
    </query>
</queries>

Using a "main" query paired with at least one "main-metadata" query:

The main query returns only the unique identifier for the objects. The main-metadata queries are run for each row returned by the "main" query. And the ? symbol will be replaced with the value of the key column specified in the main query.

If the main query has other columns in the select other than the ID, the main-metadata queries are ignored.

Versioned Objects

To scan versioned objects from a database you need two types of information from the objects:

  1. A version identifier that has the same value for all objects that are part of the same version tree.

  2. A way of speficying the order of the versions. This can be done either by ordering the results of the query based version number for example, or by using the parentid element to point to the previous version of the current one, with the first version having null for the parentid.

Example 1:
<queries>
    <query type="main" key="uid" versionid="version_identifier">
        select uid, version_identifier from docs.my_table where folder like '/test%'
    </query>
    <query type="versions" key="uid">
        select uid, version_level from docs.my_table
        where version_identifier = ?
        order by version_level asc
    </query>
    <query type="version-metadata">
        select name, title, subject from docs.my_table where uid = ?
    </query>
</queries>
Example 2:
<queries>
    <query type="main" key="uid" versionid="version_identifier">
        select uid, version_identifier from docs.my_table where folder like '/test%'
    </query>
    <query type="versions" key="uid" parentid="previous_ver">
        select uid, previous_ver from docs.my_table
        where version_identifier = ?
    </query>
    <query type="version-metadata">
        select name, title, subject from docs.my_table where uid = ?
    </query>
</queries>

In this case the purpose of the main query is just to provide the versionid value for the versions query. Afterwards, the versions query is ran for each versionid, and creates the objects for each version tree. If no parentid is provided, the versions are taken in the order the SQL query returns them.

You still need to specify a key column in the main query but it is not used from this query. The key value from the versions query will be used instead.

If you use a versions query, you need to use the version-metadata queries for fetching the metadata.

ContentPath

If your database references objects that already exist on a filesystem outside the database, you can add an element named contentpath to your main-metadata or version-metadata queries to specify which column contains this information.

<queries>
    <query type="main" key="uid">
        select uid from docs.my_table where folder like '/test%'
    </query>
    <query type="main-metadata" contentpath="file_location">
        select name, title, subject, file_location from docs.my_table where uid = ?
    </query>

The content will not be exported by the scanner in the location specified by this attribute. The value should point to an existing file.

BLOB / CLOB content

If your database contains content stored in BLOB or CLOB columns you can extract this content by adding a main-content or version-content query to your query file (deppending on wether you're scanning versions or not).

Main query example:
<query type="main-content">
    select [nameOfBlobColumn] as BLOB_CONTENT, [nameOfFile] as FILE_NAME, [nameOfExtenstion] as FILE_EXTENSION from MYTABEL where ID = ?
</query>
<query type="main-content">
    select [nameOfClobColumn] as CLOB_CONTENT, [nameOfFile] as FILE_NAME, [nameOfExtenstion] as FILE_EXTENSION from MYTABEL where ID = ?
</query>
Version query example:
<query type="version-content">
    select [nameOfBlobColumn] as BLOB_CONTENT, [nameOfFile] as FILE_NAME, [nameOfExtenstion] as FILE_EXTENSION from MYTABEL where ID = ?
</query>
<query type="version-content">
    select [nameOfClobColumn] as CLOB_CONTENT, [nameOfFile] as FILE_NAME, [nameOfExtenstion] as FILE_EXTENSION from MYTABEL where ID = ?
</query>

The column for BLOB/CLOB is mandatory and it must have the alias BLOB_CONTENT or CLOB_CONTENT for the scanner to process the value correctly and extract the content.

The columns [nameOfFile] and [nameOfExtenstion] are optional.

If they are set, the scanner will use the values to build the name of the file where the content will be exported.

The scanner avoids overwriting exiting files by adapting the file name to be unique in the export folder. Characters that are not allowed in the file name will be replaced.

Filename and extensions are also saved as source attributes so they can be used in the transformation engine.

For avoiding name conflicts with the other attributes, you can set aliases for these columns.

If the columns are not set, the scanner will use the value of the id in source system as a filename.

Multiple contents for a single object are allowed. If the query returns multiple rows the content returned by every row is exported. All the paths will be stored in the source attribute “mc_content_location” and the path from the last returned row is set as primary content in the column “content_location”.

Delta migration

If the parameter “scanUpdates” is not checked, the scanner will only scan new objects or new versions of existing objects and it will ignore all objects that were previously scanned, i.e. that already exist in the MC database (based on their ID in source system).

If the parameter “scanUpdates” is checked, the scanner will scan new objects or new versions of existing objects and it will detect if it should scan existing objects as updates. If an object will be scanned as an update depends on several factors:

  • If there are no attributes specified in the “deltaFields” parameter, the scanner will scan every object that already exists in the MC database as an update.

  • If one or multiple attributes are specified in the “deltaFields” parameter, the scanner will scan an object that was previously scanned as an update only if a value of a delta field in the source database is different than the corresponding value in the MC database. If all values (of all fields defined in “deltaFields”) in the source database match the values in the MC database, then the object will not be scanned as an update, it will just be ignored.

The field names in the “deltaFields” are case sensitive so you should define them as they are scanned by the scanner.

Query File Validation Schema

The scanner will validate the queries configuration file against the following xml schema:

<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="queries">
    <xs:complexType>
      <xs:sequence>
        <xs:element maxOccurs="unbounded" name="query">
          <xs:complexType>
            <xs:simpleContent>
              <xs:extension base="xs:string">
                <xs:attribute name="type" type="QueryType" use="required" />
                <xs:attribute name="name" type="xs:string" use="optional" />
                <xs:attribute name="key" type="xs:string" use="optional" />
                <xs:attribute name="versionid" type="xs:string" use="optional" />
                <xs:attribute name="parentid" type="xs:string" use="optional" />
                <xs:attribute name="contentpath" type="xs:string" use="optional" />
              </xs:extension>
            </xs:simpleContent>
          </xs:complexType>          
          
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:simpleType name="QueryType">
    <xs:restriction base="xs:string">
      <xs:enumeration value="main"/>
      <xs:enumeration value="versions"/>
      <xs:enumeration value="main-metadata"/>
      <xs:enumeration value="version-metadata"/>
      <xs:enumeration value="main-content"/>
      <xs:enumeration value="version-content"/>
    </xs:restriction>
  </xs:simpleType>
</xs:schema>

Annex: query elements

The xml file is composed for <query> elements that contain the queries that will be run by scanner for extracting objects and metadata. The <query> element may have the following attributes (some of them mandatory):

type

Defines the type of the query. The following values are possible: main, versions, main-metadata, version-metadata.

  • main – is the query the returns the unique identifier of every object the will be scanned. In case there is no main-metadata query, the main query may contains any number of columns which values will be added as metadata. In this case the query element may contain the contentpath attribute that will indicate the column where the object content location is stored. If a main-metadata query is specified, only the key column in the main query will be returned, all the other columns being ignored. A valid configuration must contain a single main query definition. If you need to deal with versions, this should also return the unique identifier of every version that will be passed to versions query.

  • versions – is the query the returns the unique identifier and optionally metadata of every version for the objects returned by the main query. It will take as parameter the value of column specified in attribute versionid defined in the main query. This query will be run once for every row returns by the main query. This also may contain the contentpath attribute that will indicate the column where the version content location is stored.

  • main-metadata - are the queries that extracts the metadata for main objects. They take as parameter the value return by the main query in the column specified by attribute key. These queries will be run once for every row returns by the main query. You can define an unlimited number of such queries.

  • version-metadata - are the queries that extracts the metadata for versions. They take as parameter the value return by the versions query in the column specified by attribute key. These queries will be run once for every row returns by the versions query. You can define an unlimited number of such queries.

  • main-content – is the query that extracts the content for main objects. It takes as parameter the value return by the main query in the column specified by attribute key. The query will be run once for every row returned by the main query.

  • version-content - is the query that extracts the content for version objects. It takes as parameter the value return by the versions query in the column specified by attribute key. The query will be run once for every row returned by the versions query. Note: If versions query is present, only the objects returned by this query will be extracted. The main query will be used only for identifying the version trees that will be scanned. In this case the main-metadata queries will be ignored.

name

Defines the name of the query and it will be used for logging purpose.

key

Mandatory for main and versions queries. Defines the column name which value will be stored in MC column id_in_source_system. The value of the column defined in this attribute will be passed as parameter in main-metadata or version-metadata.

versionid

Defines the column name which value will be passed as parameter to the versions query. It can be defined only for main query.

parentid

Defines the column name that contains the id of the parent version. This might be used only in case of branches. When not used, the versions will be scanned in the order they are returned by the query. Optional for versions query.

contentpath

Defines the column name that contains the path where object content is stored. It will be used to populate the MC column "content_location".

Last updated