Friday, December 11, 2020

World of Analytics - Trends during and post COVID-19

As we evolve, either as a technologists, Leaders or Humans we learn from every interaction in our lives. How we respond to those incidents is purely our choice. COVID-19 has changed many lives around in many ways but it also showed us in our life times that we humans survived every situation be it pandemic or natural calamities. what we learn from this is what adds to our wisdom pool. 

How does the pandemic change the World of Analytics?
It all ties back to the  Survival of the fittest, term coined by Charles Darwin.
The vaccine companies during the pandemic conducted more than 500 clinical trails for a potential treatment, they have collected and curated data from trial registries and other sources to predict the spread and treatments along with clinical management of the pandemic.

What did they use apart from the chemical drug manufacturing process? -Data and Analytics
How does this emergency situation correlate to other companies and Industries? 
Now lets talk how Data, Analytics combined with AI will be the next generation of weaponry to predict, prepare and proactively lead the race in this unprecedented market shifts; Never before than now, Analytics leaders to game-up and roll their sleeves up to increase their pace and scale of analysis in processing data quicker, innovate and forge new ways to stay ahead in a competitive corporate world.

Few Trends that are surfacing or will Surface in the field of Analytics
  1. Every company is warming up to get a piece of the pie in their respective market share, data becomes crucial so is the toolbox to sculpt the product. Skilled resources and beating the market becomes the key. AI techniques with Machine learning and Natural Language Processing (NLP)  will give an edge to provide insights to proactively act rather than reactive mode. Hardware chip architecture are evolving, from past processing CPUs to neuromorphic hardware for edge devices to accelerate computations and workloads and reducing the reliability on hub and spoke architecture of centralized systems and reducing the latency and load on bandwidth.
  2. Gone are the days of dashboard reporting, more automated dynamic story telling will be the trend. The in-context data stories ( most relevant insights) will be broadcasted to users based on their mapping roles. Data Leaders will need to regularly evaluate their BI tools and innovate their stack to meet the growing demands of the changing business requirements.
  3.  A shift from Business Intelligence to Decision Intelligence: Companies needs to shift the paradigm and prepare their mindsets to Decision Intelligence which are more decision modeling, decision management, with agile complex adaptive computation systems which combines the multi-disciplinary methods that Analytics leaders needs to get their hands dirty. This is a trend which already has its seeds planet and sprouting began.
  4. We are stepping into the world of X- analytics which is a Gartner coined term which encompasses all kinds and ranges of data. Thought leaders should incorporate the mentality of X analytics to prepare for future analytics problems NOW. Pandemic is a great example why preparedness is the key to survival just like an Earthquake kit in California or Storm room in Ohio. Survival is the key but if that is to endure, then X-Analytics is the new norm to adopt along with Augmented machinery using your own operational data to consolidate the architecture with automation.
  5. The buzz word of cloud should not the shiny tool anymore; In the world of economics of scaling, cloud is given, let it not be a your goal, it must be your means to achieve the X-factor!
  6. As like data, tools and technologies, the resources(process and people) needs to be revamped as far adapting to changing technologies. With AI taking over using many deep learning techniques, humans are challenged even more to learn and keep up to the ask. There is a huge gap to meet the upcoming demand in the next five years.
  7. If we take a pause, and look around the companies will be divided into four parts:
    1. sellers of data
    2. buyers of data
    3. sellers of technology
    4. consumer companies
  8. Technology, people, process(like governance) all are hidden in all the above categories. For e.g. take an example of Blockchain, which leverages: data, its lineage ,and transparency of the transaction, with network topology. Companies who dive into it will make faster, easier business decisions than others. The key take away is data and analytics is not all about the tools but the thought processing and provoking art of the analytics leaders. 
  9. Let the trends of the market determine the right question you should be asking as analytical leader and that business problem narrow your tool search. Not the other way.

YogAI - your yoga guru powered by AI

YogAI – Your Personal Yoga digital Guru powered with AI. Equipment’s used to analyze internal human bodies and external machines for assisted living has been proven beneficial to humans. However estimating the right curvature and posture in a dynamic movements has always been a deep rooted problem in the world of computer visualization, especially when it comes to yoga. The reason is not that it has many variations but the perfection in a pose is critical to get maximum benefit. People these days prefer self-paced training, what if there is a system that is available for yoga?That’s where YogAI comes into play. A unique human yoga pose recognition system, self-instructs people to practice yoga correctly. A project based on Machine learning and deep learning algorithms which uses pose estimation and key point detection methods to guide individual to do the right yoga pose. Any device of your choice will become your yoga instructor. What if we channel this to a specific posture to guide and trian you to help you with healthy Liver. It not guides you to do the right posture but what if it recommends you to set of yoga postures based on your ailment and complaint. The next version of the yogAI will incorporate the herbs that help to enhance your liveliness along with Yoga? This is where Ayurveda comes into play. It customizes herbs according to your body constitution and keeping age and other factors into consideration. The possiblities if limimtless when you power AI, ML and Deep Learning together in building dynamic real time using key point detection and pose estimation.

Monday, March 14, 2011

Its all about Journaling

JOURNALING

The literal meaning for that is a personal record of occurrences which are kept on a regular basis like a dairy with daily proceedings etc. So how does it applies to the Database world. Journal can be applied at the table level or at the database level. Journal tables are assigned perm space and can reside within the same or on different database. The limitation is only one journal table per database. Journaling is implemented for for data protection and disaster recovery. The whole reason of Permanent Journaling (PJ) is to capture and maintain any chronological changes made to the rows of the journal tables. It helps to protect corporate data for any committed, uncommitted or aborted transactions. since it has the ability to capture before and after snapshots, it is a safe haven for data restore capability. Data in the PJ will only be dropped if it is issued, it withstands system restart.

Syntax to define a Journal table:
CREATE SET TABLE databasename.J1 ,FALLBACK ,
NO BEFORE JOURNAL,
DUAL AFTER JOURNAL,
WITH JOURNAL TABLE = DATABASENAME.vasu_edw_after_jrnl_tbl ,
CHECKSUM = DEFAULT
(
col1 INTEGER NOT NULL,
col2 TIMESTAMP(6))
PRIMARY INDEX ( col1 )

so once defined the Journal there are various methods to handle it:
1. If the size grows then how to control it
2. if for temporarily you want to disable Journaling
3. if you want to permanently remove it?
4. how to handle at table level as opposed to database level?

Lets discuss case by case:

* Note: Journal table information is found in dbc.journals

Journal table size grows and grows until you move to a saved set using the clause "CHECKPOINT WITH SAVE" in the ARCMAIN script. this will move the data from the journal set to the saved set. you can back the saved set and then delete it to regain the space. Any ways let me give you the detail information how this works:

STEP 1:
we defined the journal table in the sql for table J1 and the journal table name is vasu_edw_after_jrnl_tbl. This table grows bigger and bigger.
STEP 2:
we need to reduce the size of it. we need to move this journal table to the saved set and the commands and syntax will be issued from the arcmain utility.
First backup the data tables in the database and then backup the journal tables
with "checkpoint with save " option so once moved to saved set you can delete the
set and thus you can reduce the size of the database.
a) logon TDPIT/backupuser, passwordnetvault2,netvault2009;
archive data tables
(database.tablename) ,
release lock,
use group read lock,
file = testtbls;
logoff;
b) logon tdpid/backupuser,password;
checkpoint (database.vasu_edw_after_jrnl_tbl) ,
with save,
use access lock,
named prdchkpt;
archive journal tables
(database.vasu_edw_after_jrnl_tbl) ,
release lock,
file = prdjrnl;
delete saved journal (dsw_prod_edw_jrnltbls.vasu_edw_after_jrnl_tbl);
logoff;

STEP 3:
If your goal is TURN OFF JOURNALING temporarily both at table and database level
then either MODIFY command is used for the database or the user or ALTER statement can be used for the table level.

ALTER TABLE ., no before journal, no after journal;
for the database level
MODIFY DATABASE AS NO BEFORE JOURNAL, NO AFTER JOURNAL;

once all the tables in the database which were earlier defined as before or after journal are modified using the alter statement, you can now drop the journal table.
so assuming all the alter statements are executed issue the modify database command to drop the journal table.

MODIFY DATABASE AS DROP DEFAULT JOURNAL TABLE = TEST.vasu_edw_after_jrnl_table.
The above one will get rid of the journal table.

Thursday, January 27, 2011

Performance Tuning 101

Introduction
Performance of any system is so critical for any business for a timely decision making process. Strategic decisions based on sql performance is so vital that all the intelligence derived can turn informational events into insightful information that brings value to the business.Performance tuning is so broad that I decided to categorize and tackle it. Tuning applies to both Application and System level. we all start our Application Tuning thinking that the system is all set and ready to deliver the solutions, most of the time it is not true as system behavior differs from site to site and business to business. I'll not write this blog with the premise that system tuning is in place. I'll start from the blocks of system tuning and then head to the Application tuning.

Tuesday, October 12, 2010

ONLINE ARCHIVAL procedure and troubleshooting.

ONLINE ARCHIVE

Online archiving is a facility that is available to to have concurrent DML operations run while the archive process archiving. ONLINE ARCHIVE means that the database or tables that are archived are not locked for any DML operations, meaning archive is done during the real time operation of the business. How it works is the logger is triggered which captures all the DML changes that are happening during the archive process, once the archive finishes the logger applies all the changes either it could be commit or rollback statements. Remember the logger is active only when the archive is running and not active when the archive job finishes its archival process.

Code for ONLINE ARCHIVE

ARCHIVE DATA TABLES (DATABASE.TABLE),
RELEASE LOCK,
ONLINE,
FILE=;

How to find which tables or databases are archived ONLINE currently:

select CreateTimeStamp,
DatabaseName (VARCHAR(30)),
TVMName (VARCHAR(30))
from DBC.ArchiveLoggingObjsV;

DBC.ArchiveLoggingObjsV is a view that provides information about a table(s) that has an active log for the online archive.

Notes: It is not recommended to have the view databases to be archived with ONLINE criteria. I came across a bug dated October 10th 2010 when you archive view databases online, the archive job doesn't release the online logging active log. so it prohibits to do any DDL changes like Refresh view even after the archival process is completed. you need to issue an exclusive OFF statement, it is below:

LOGGING ONLINE ARCHIVE OFF FOR (DATABASENAME), OVERRIDE;
---you use the override in situations where the user ID logged in to issue the above statement is different from the user ID that turned ON the ONLINE LOGGING.
---One more note is that you need to issue this statement as an ARC statement and not any SQL statment. It is not like the sql statement that you use to release the HUT locks from any sql tool.
I'm attaching a simple script and how to invoke it.

step1:
create an arc script with any name like release_online_arc_lock.arc
step2:
code for the script should be like this:
.LOGON TDPID/dbc, ;

LOGGING ONLINE ARCHIVE OFF FOR (DATABASE), OVERRIDE;

.LOGOFF;

step3:
After saving invoke the command prompt, a DOS command prompt assuming the BAR server is on a windows box. type the below statement in the command prompt and enter
arcmain
step4: You should see in the following:

C:\Documents and Settings\vkillada>arcmain release_lock.arc
10/11/2010 13:19:42 Copyright 1989-2008, Teradata Corporation.
10/11/2010 13:19:43 All Rights Reserved.
10/11/2010 13:19:43
10/11/2010 13:19:43 *** **** ****
10/11/2010 13:19:43 * * * * * PROGRAM: ARCMAIN
10/11/2010 13:19:43 ***** **** * RELEASE: 13.00.00.05
10/11/2010 13:19:43 * * * * * BUILD: 080221eW (Nov 12 2009)
10/11/2010 13:19:43 * * * * ****
10/11/2010 13:19:43
10/11/2010 13:19:43 RESTARTLOG = ARCLOG101011_131943_130196.RLG
10/11/2010 13:19:43
10/11/2010 13:19:43 PARAMETERS IN USE:
10/11/2010 13:19:43
10/11/2010 13:19:43
10/11/2010 13:19:43
10/11/2010 13:19:43 CHARACTER SET IN USE: ASCII
10/11/2010 13:19:43 .LOGON TDPID/dbc,;
10/11/2010 13:19:45 LOGGED ON 3 SESSIONS
10/11/2010 13:19:45
10/11/2010 13:19:45 DBS LANGUAGE SUPPORT MODE Standard
10/11/2010 13:19:45 DBS RELEASE 12.00.03.08
10/11/2010 13:19:45 DBS VERSION 12.00.03.08
10/11/2010 13:19:45
10/11/2010 13:19:45 STATEMENT COMPLETED
10/11/2010 13:19:45
10/11/2010 13:19:45
10/11/2010 13:19:45 LOGGING ONLINE ARCHIVE OFF FOR (DATABASENAME),
OVERRIDE;
10/11/2010 13:19:45 UTILITY EVENT NUMBER - 326813
10/11/2010 13:19:46 STATEMENT COMPLETED
10/11/2010 13:19:46
10/11/2010 13:19:46
10/11/2010 13:19:46 .LOGOFF;
10/11/2010 13:19:47 LOGGED OFF 3 SESSIONS
10/11/2010 13:19:47 STATEMENT COMPLETED
10/11/2010 13:19:47
10/11/2010 13:19:47
10/11/2010 13:19:47 ARCMAIN TERMINATED WITH SEVERITY 0

THE ONLINE will trigger again if the online archive arcmain script is initiated.
Now if you run this query again you should not see the database that you could not get released after the arcmain script kicked off. The above script exclusively releases the logging locker.


select CreateTimeStamp,
DatabaseName (VARCHAR(30)),
TVMName (VARCHAR(30))
from DBC.ArchiveLoggingObjsV;

if you've any questions related to ARCHIVAL AND LOGGING ISSUES....please feel free to email me
vasudev.killada@gmail.com or at least comment to have me update this blog with more information.











______________________________________________________________

Wednesday, October 6, 2010

Rollback Priority in Teradata

Roll-backing is a CPU intensive job. It happens when the User aborts a query which has done some DML changes to the database and the database is in a phase of revoking all the changes. This happens to maintain the transaction integrity and is maintained through transient journal also called TJ. All the incomplete transactions are stored in the transient journal. The system table is dbc.transientjournal.

Imagine if you have millions to billions of rows that needs to be rollbacked and the caveat to it is Rollbacks are executed in the RUSH performance group of the default partition if the DBS settings are set to FALSE which is default. The rollback job will compete for CPU with other users and will impact the performance a lot. In addition the rollback will have the locks on the tables that are involved in it which are not accessed to the users. you can wait till the rollback finishes or can cancel the rollback job. I've written in my blog how to cancel a rollback job, refer - http://goterabytes.blogspot.com/2010/08/canceling-rollback-job.html

You can cancel the rollback for table availability. It is designed for its own benefit so that rollbacks finish quick and the table is available to the business. But if other users are impacted it needs to be run not in the highest performance group of the default partition.


ROLLBACKPRIORITY SET TO TRUE:

If RollbackPriority is TRUE, rollbacks are executed within the aborted job's performance group and associated resource partition. The intent of this is to isolate the rollback processing to the job's performance group, while not affecting performance of the rest of the system. TRUE is better for smaller rollbacks to non-critical, less extensively used tables.

This only affects from aborted sessions by the user and does not have any affect on the rollbacks caused by the DBS restart. The restart rollback setting can be configured by the rcvmanager utility.

Swapping in Teradata

Swapping in Teradata is bad and has a big impact on performance. We can witness the difference with few performance tweaks.
One among lots that I witnessed recently was fsgcache%. The more the swapping the more the impact on Performance. There should be a balance between Teradata resources and OS resources. The Operating system needs to have room to run its process. We certainly do not want to starve cache in favor of disk reads, but when we swap, we change the whole layout and picture and then all the AMPS will be throttled by a single swap drive. Outside of FSGCACHE are things that the O/S needs room to run.

To eliminate this swapping on a performance critical boxes, it is highly recommended to look into FSGCACHE %. This is pre configured when it is shipped but it doesn’t apply or hold true to all the businesses. If swaps are significantly worse then lowering the value will have a bigger impact.

Things to look:
• Swap on system
• Skew between nodes
• LINUX run queues

How to look swap on a LINUX system

This can be done utilizing the sar utility in LINUX. It is a system activity report.
The samples are taken every five minutes throughout the day, the impact of swap can read and reviewed by running the following commands:

psh sar -W
This will output the entire day's samples whenever it is run. The "psh" part of it runs it across all the nodes. Do not expect to see anything in the Hot Stand By or HSN node, since Teradata is not actively running on HSN.

Options with sar command which needs to be frequently used.

sar -u
%nice - this takes the place of %user that we used to use in MP-RAS
%sys - this is the time spent on system calls
%IO wait - This is all I/O on the system (disk, bynet, etc.).
%cpu idle - This is a good indication of CPU utilization
Like to see %nice to %sys at 60-70%/30-40%
%IO wait can be highly variable since it records all I/O. I don't use it too much.
%cpu idle can certainly run down to 0%, but for extended periods, probably means you are near capacity.

sar -q
runq-sz - is the number of runnable Linux processes waiting to be serviced. This is a good indicator of application activity. Teradata primarily runs AWT processes. Since you have 63 usable AWTs, when the run queue exceeds that, processes wait. It is not bad to wait sometimes, and this queue gets drained
and refilled at a high rate. When I see periods of 100+ in the run queues for sustained periods of time, this will indicate a high degree of concurrent work. This can also be compared between nodes on a system to get a view of overall skew on the machine.

sar -W
pswpin and pswpout - I add these together to get an overall picture of swap. Swap in Teradata is bad. It throttles the AMPS from performing efficiently. FSGCACHE is utilized to tune this. Occasional bursts of this in the 0 - .5 range will not impair operations. Sustained in that range should be tuned.
Ranges from .5 to 5 blocks/sec need tuning to eliminate them. These will impact performance.Ranges above 5 require emergency tuning since their impact on Teradata operation is extreme.These are averages in blocks/sec, so within a 5 minute interval (which is what we sample), this could indicate muchhigher peak values for many seconds at a time, which could severely impact <10sec SLAs for high performance
queries.

And for more information a manual page for sar can be viewed on the system itself.

Peeking facility in Teradata

What is peeking?

Peeking means to check all potential optimization by looking at the USING values during the query parsing time and use those to pick efficient path to satisfy the request.

What is generic plan?
Generic plan is a generic plan for all the like queries, which is cached and reused to save parsing time, it is the time that CPU takes to parse and generate a plan and sending it across to the Dispatcher, but a cached plan is not always the right and correct plan.

What is specific plan?

A plan that is generated specific to a query and is different for different USING values.

What determines an Optimizer to use a specific plan?

DisablePeekUsing field in the DBS control settings determines to use whether to use specific or generic plan. By default it is shipped as FALSE meaning Optimizer Performance enhancements are enabled. The Optimizer has the ability to check various USING values and pick an efficient path. By making it to TRUE , the optimizer enhancements are disabled.
When the setting is set to TRUE, parameterized queries will no longer be eligible for specific plans, generic plans will be used for all parameterized queries, by doing so the parsing overhead is eliminated when the peeking is turned off.

Tuesday, August 31, 2010

Amp Worker Tasks behavior using STREAM or TPUMP operator.

Tpump now called stream (TD12 - Teradata implementation for real time trickle feed) under the covers of TPT; it’s a row level operation which based on the hashing algorithm does the DML operation on the hash row. There could be an application level contention based on the NUPI defined on the table having operations on the same rowhash ID. This can be controlled by proper configuration of the parameters and settings (SLEEP, TENACITY, SERIALIZATION, SESSIONS etc). The session blocking is only momentary but sometimes this can lead to overhead contention and nuisance alerts to the DBAs. Having more sessions is as harmful for performance as having as little as one session, so careful analysis in deciding the sessions will not only improve TPUMP throughput but also improve overall system performance.

System level design for TPUMP operation w/ table designed as fallback

When a DML operation is performed on a fallback table P2P (point-to-point) protocol message is spawned from the main primary AMP to the fallback AMP. The fallback AMP now has the copy as a fallback. This entire operation needs two AWTs (Amp Worker Task) doubling the number as opposed to regular non fallback DML operation.

For e.g.: How it works when a user submits an insert operation on table with fallback definition
Steps:
1. An Amp Worker Task is assigned on the AMP that has the primary copy. Once the insert is successful a message is transmitted to the fallback AMP.
2. Once the fallback AMP receives the message it will try to insert into its AMP so needed another AWT, the AWT on the primary is not relinquished till the operation is successful on the fallback. So two AWTs are on hold until the fallback operation is successful. If the fallback AMP doesn’t get the AWT right away it will be in the queue and the primary AMP will not relinquish its AWT. You can call it as a design flaw or data integrity but that is how it works.

So if you are seeing any delays on the top of deadlocks that TPUMP has on because of multiple sessions, it also could be AWTs which are so minimal but on a busy day with a flow control situation, it worsens to the whole performance package.

Recommended actions/solutions:

1. I don’t recommend this but some sites do this; for Tpump Jobs you can specify some reserved AWTs. Because reserving an AWT will have a 2n+1 rule in place, where n is the number of AWTs to be reserved. The spawning of AWTs make it worse when we reserve.
2. Have both object and workload throttles to have so many AWTs available in the spool so that TPUMP jobs have good throughput. I recommend this approach and it worked much better.

Canceling a rollback Job

A big Rollback Job can bring the system to its knees. If your DBS setting parameter RollbackPriority is set to FALSE then it takes the priority and will not honor any rules related to CPU assignment or TASM or anything. The Rollback job runs in RUSH priority and will bring down your system. This and the ALTER statement will never honor your CPU limitation.

From TD12 I've started canceling of Rollback Job but it is made possible from V2R5.1 release. This may lead even to drop the table on which the cancel rollback is applied, so chances are to lose the data. if you've it from the back up you either can restore/copy the data from the backups.

so what causes a bad rollback?
A user submitting a DML operation which after running a while and not giving the results in the stipulated time or is taking too much time, either the user wants to abort it or wait till it finishes, if he choose to abort and this will cause a rollback on the table which takes more time to relinquish the resources.

How to avoid it?
Stop issuing the rollback or if you issued a rollback then try to cancel the rollback. I suggest in the DBS Control Record of General Fields for the column RollbackPriority set to TRUE which will nullify the rollback to be a resource hog.
For some small rollback jobs the time taken could be negligible but for some big jobs the rollback needs to finish asap so that the ETL team can load further or start loading to the same table with correct update/insert statements.

Notes:
Remember that you may want to drop the table and create a new one most of the time when you cancel a rollback job as table state could be corrupted for most of the times.
The rcvmanager utility is used to do the Job using the root password for the system assuming it is a LINUX box.

Steps to cancel a Rollback Job:
1) Log into the node using the root username and password.
2) Connect to the CNS by typing CNSTERM 6 to find where rcvmanager is? usually it is
in window 1 and press control C to come out of it and type cnsterm 1
3) type start rcvmanager and once you get the prompt for enter the command type -
list rollback tables
4) once you get the list of the rollback tables and get the tableID for the table
you need to cancel rollback.
5) your table ID is in hex vaule so makes sure you copy values before and after the
colon (:)
6) Issue the cancel rollback statement; cancel rollback on table table ID and it
prompts for the DBC password.
7) To verify the status you can issue - list cancel rollback tables to get the
Information about the rollback status, it will be in pending state.
8) Once done you can see the system performance and its speed. At times during a
rollback I observed no more logons are allowed by the database; once the rollback
cancellation all the jobs and resources come to normal behavior.

Hope this article helps to folks who fall in this situation.

Why do we need Query Banding?

By enabling Database Query Log (DBQL) and installed the performance data collection and reporting (PDCR) database in all Teradata environments the organization will have plethora of log information, statistical information and other audit information.
The DBQL and PDCR log and store historical data about queries including query duration, CPU consumption and other performance metrics. The data offers information to calculate suspect query indicators such as large-table scans, skewing and highly consumptive joins. In addition, the data provides diagnostic metrics to help tune applications, monitor performance, manage capacity and operate the Teradata system at peak efficiency.
To fully leverage DBQL and PDCR capabilities, it is imperative for the BI group to enable query banding in universe, which will provide us a valuable tool to band each BusinessObject query submitting to Teradata with a number of attributes and values that provide detailed information about the query. The information allows us to effectively tune BusinessObjetcs applications by identifying and tuning target reports and proactively preventing report performance problems.
Query banding is especially valuable for our BusinessObjects applications since we use the generic accounts to access Teradata database. Query banding will enable us to identify the unique attributes of each request.

How to set Query Banding in universe parameter?

Teeadata database supports passing of parameters before the SELECT statement. The BEGIN_SQL parameter allows us to prefix SQL statements with the query banding values each time a SQL statement is generated. Below are steps to set up the query banding in universe:

• Start the Universe Designer
• Open your Universe (for existing Universes, please import them first)
• Click on File menu -> Parameters…
• Click on Parameter tab
• In the "Parameter" list, click on BEGIN_SQL parameter
If does not exist, add one.



• Enter value for BEGIN_SQL:
Copy following name/value pairs into the Value box:

SET QUERY_BAND = 'USER = '@Variable('BOUSER')'; DOCNAME = '@Variable('DOCNAME')'; DPNAME = '@Variable('DPNAME')'; DPTYPE = '@Variable('DPTYPE')'; UNIVERSE = '@Variable('UNVNAME')'; UNIVERSEID = '@Variable('UNVID')';' FOR TRANSACTION;

Variable description:
a) BOUSER - user login
b) DOCNAME - the name of the document
c) DPNAME - the name of the Data
d) DPTYPE - the type of the Data
e) UNVNAME - the name of the universe
f) UNVID - the ID of the universe used

• Click Replace



• Click OK
• Save the Universe
• Export the Universe to CMS

Oracle to Teradata Migration Project tips

Introduction:
The decision to move to Teradata from an OLTP systems like oracle etc, with multiple data marts into an Enterprise Data warehouse is not complicate to start with but this needs right skills and planning to do and this gives the edge for the business to have the 3NF physical model, linear scalability, mixed workload capacity in an active and challenging business day-to-day life.

Birds Eye view agenda:
• Get the business requirements from the client anything that is necessary/related to the migration
• Have an approach how to design the physical database like tables, Indexes, Views and other database objects.
• All the business related Log and Historic Information related to data migration.
• Incremental Load conversion


Detail Technical and Business artifact for Migration Implementation

Client related Infrastructure:
• Business related rules and triggers embedded in data
• Data model of the existing business and the model for the tales that needs to be moved.
• All the software and hardware compatible with the Release of the Teradata that is in place.
• Make sure that the requirements are at least frozen during the port operations at the sending side (Oracle/DB2).
• If multiple data marts are ported do one at a time and then determine name of the production database and view database.
• Design a strategy in creating profiles and users.
• As part of the data movement and migration make sure we have the information of what gets moved and how, meaning are we landing onto a disk and doing or direct fork lift?
• Determine the right PI in table designing talking to the ETL and Front end Reporting folks.
• Design initial load jobs with the assumption that the target tables are created.
• Determine the type of slowly changing dimensions and design accordingly.

Approach for Physical DB design and conversion:
• Teradata maximum length of a database object is 30, so keep in mind always about the naming convention, like DB2 OS390 Maximum name length = 18 DB2 Unix Maximum name length = 128, no problem with Oracle as its limit is also 30 characters.
• Ask if the client has any naming standard, if so we can follow it; so that a mapping sheet is created for 120 to 30 characters and have the business sit with us to review the sheet and agree upon for a valid names
• Like have all the DDL and indexes to be converted, for DB2
o Use the DB2 system Catalog as input into a conversion script
o Use data modeling tool like ERWIN and do the reverse engineering
o Have a DB2LOOK command and an editor.
o Same thing goes with the Indexes, go with Teradata index creation criteria; use the columns and index table to u unload from the DB2
o Enough testing needs for converting UDFs into Teradata, either the source is DB2 or any other database because the language and features are different. The easy way to handle this is to put the UDF is to create macros or stored procedure and UDTs to data containers in stored procedure.


Log and Historic migration:

• Determine the migration approach, is it going to be an ETL tool doing the pull and push or the source data is put into flat files and then use this to load into Teradata using load utilities; since the tables are empty a FASTLOAD can be used to do the loading.
• The SLA violation is something that determines how to load the data. From legacy systems to a system using legacy ETL and then the Teradata ETL methodology to production or use a scripting tool to generate the load and unload scripts to get the data in place, the second one is an automated process which loads and unloads at the same time (using NDM technology)
• Have all the rules next to you like for instance the flat file format rules for transfer to Teradata, what if it is delimited to fixed or both.


Incremental Load conversion:

• Requirement can be completely ported to Teradata and run the loads or during the port operations the data could be loaded to the DB2/Oracle database; there are ways to handle this:
o Either stamp with a date and the migration moves the data from a point in time also called Incremental load synchronization
o The other way is a refresh, the data is moved at the very end of the migration process after the source system finishes it last run before the first run of the new system, if this is not possible for business continuity reason then a custom solution needs to be implemented to capture the change.

Process flow for a smooth data migration

• The design should be in place for the migration before testing anything with all validation in place
• Then starts the ETL and the specs to build followed by migration of data
• The Integration testing comes into picture with all the sync effort with the parallel systems if it is implemented that way.
• For huge migration projects automation is recommended like create process, macros in place for real time reporting status for tracking purpose.

Conclusion:

For any projects planning phase is so important. Measure thousand times before you cut. I took some time to write few things to share my knowledge and experience as an Lead Teradata DBA, DB Architect and GoldenGate Specialist with strong ETL Architectural development, I put this document as a high level to share my knowledge. I’ll be more than happy if this document can help in any of the missing gaps of your planning and analysis phase.

Fast Path Insert and Delete Statements.

Rules for Fast Path Insert and Delete Statements

Solution/source: Knowledge article

Tech Note

Unconstrained (Fast Path) DELETE Processing
When submitted by itself as a transaction, an unconstrained delete
(DELETE...ALL) is processed very quickly.
An unconstrained DELETE is one where no conditions are attached, that is, all
rows are deleted. When an unconstrained delete is submitted as part of a
multi statement transaction, the speed with which it is processed depends on its
position in the transaction.
To enable fast path processing, which defers processing until a transaction ends
and allows rollback processing to be almost instantaneous, positioning must be
as follows, depending on the mode being used.

In Teradata Mode:
FOR this type of transaction Positioning must be as follows
Implicit (such as a multistatement macro or BTEQ request) DELETE...ALL must be the last statement in the request.
DELETE...ALL must be immediately followed by the END TRANSACTION
statement that terminates the currently open transaction.
Explicit This implementation is not available to a Teradata embedded SQL application
because the DELETE and END TRANSACTION statements must fall into the
same request packet.
======================================================================
Fast Path INSERT … SELECT Statements
Multistatement INSERT … SELECTs are optimized so that each statement
except for the last in the series returns a ‘zero row inserted’ message as a
response. The retrieved rows for each SELECT are sent to an empty target table.
The last INSERT … SELECT statement returns the total number of rows
inserted for the entire request and sorts and merges the spool table into the
target table.
Columns defined with the COMPRESS option cannot participate in fast path
optimizations, so if you perform an INSERT … SELECT operation on
compressed columns, fast path optimization is not specified by the Optimizer
when it creates an access plan.

Rules for Fast Path INSERT … SELECT Statements
There are several restrictions that must be observed in order for the high
performance fast path optimization to succeed in a multistatement request.
These rules are as follows:
• The target table must be empty.
• All INSERT statements in the multistatement request must have the same
target table.
• Only INSERT statements can be included in the request.
If you insert other statement types into the multistatement request, the fast path
optimization does not occur (only the first INSERT … SELECT in the series is
optimized) and performance degrades accordingly.

Tuesday, May 18, 2010

Goldengate

what is goldengate?
GoldenGate enables the exchange and manipulation of data at the transaction level among multiple, heterogeneous platforms across the enterprise. Its modular architecture gives you the flexibility to extract and replicate selected data records, transactional changes, and changes to DDL (data definition language1) across a variety of topologies. With this flexibility, and the filtering, transformation, and custom processing features of
GoldenGate, you can support numerous business requirements:
● Business continuance and high availability.
● Initial load and database migration.
● Data integration.
● Decision support and data warehousing.

Friday, August 21, 2009

Introduction to Teradata

TERADATA came into existence with a goal and dream in mind, businesses that wants to build huge data wareshouses don't have to compromise because their database is unable to meet the challenges and requirements of constant change. It has the ability to quickly respond to changing conditions and is more flexible than any RDBMS. It is designed to perform parallel processing and has the capability to be flexible linearly and accommodate terabytes of data.

In short when other database vendors designed their databases around OLTP (Online Transaction Processing) and stopped thinking further, But that's where TERADATA started thinking, they came up with an idea to design a database for OLAP (Online Analytical Processing) or DSS (Decision Support Systems).

Teradata Architecture is simple and complex in itself. It is simple because the user doesn't have to worry much and complex because the RDBMS has its own complex algorithms and intelligence that knows the easiest and fastest way to retrieve data. It also has its own way of storing the data. Teradata based on the fundamentals of mathematical set theory. It is the database that loads data in parallel, backups data in parallel and process data in parallel. It is a story of parallel processing.

In my next post I'll give an overview and Architecture of the TERDATA so that it will be a jump start for all the nacent users.
....continued TERADATA Architecture
Home:
Introduction and Purpose of this Website:


Today's world is the result of human evolution through human thought process and adapting to the ever dynamic changes in this over crowded planet. It isn't limited to just that it applies to all possible areas. It is quite amazing how things have changed over the time. I would not drive you to an antique show but every morning when you wake up you are not only thrilled and excited for the latest technology but also amazed to look behind how fast things are changing.It applies the same to Information Technology. Imagine the old way of communicating through drums and messengers to today's fiber optics communications. I'm sure this page will be too outdated when some new technology will beat/replace fiber optics. The way we do business has changed considerably. Business Managers have to keep up the pace to meet challenges to stay in this competitive business. As an author of this website my goal is to share the knowledge and experience with the outside world that I know or learned. Since this website goal is to teach the Data Warehousing, Business Intelligence and Database Concepts, I'll restrict it to purely technical concepts. You can also read my social blog http://VasudevK.blogspot.com.


In my first few chapters I'll give an overview about Data warehousing and terminology. After that I'll start with the basics of teradata and go in detail.




Introduction

In today's world, Information about the market at photon's speed helps the Industry to make strategic decisions proving Darwin's theory, Survival of the fittest. Making right decision at the right time helps the wheel of Industry to pass through hurdles. Data warehousing is Industry's heart for strategc decision making.
A data warehouse is core repository of the organization's historical data, or called community memory. It is the data, information and the knowledge used by people for interacting using these object in making decisions. For example, an organization would use the information that's stored in its data warehouse to find out what day of the week they sold the most Widgets in May 1992, or how employee sick leave the week before Christmas differed between Mumbai and Seattle from 2002 - 2006. In other words, the data warehouse contains the raw material for management's DSS(decision support system). The critical factor leading to the use of a data warehouse is that a data analyst can perform complex queries and analysis (such as data mining) on the information without slowing down the operational systems.
Origin/History of Data Warehouse
During the late 80s to Early 90s Data Warehouses became a unique type of computer database . The sole purpose to develop is to meet a growing demand for management information and analysis that could not be met by operational systems. Operational systems were unable to meet this need for a range of reasons:
The processing load of reporting reduced the response time of the operational systems,
The database designs of operational systems were not optimized for information analysis and reporting,
Most organizations had more than one operational system, so company-wide reporting could not be supported from a single system, and
Development of reports in operational systems often required writing specific computer programs which was slow and expensive
As a result, separate computer databases began to be built that were specifically designed to support management information and analysis purposes. These data warehouses were able to bring in data from a range of different data sources, such as mainframe computers, minicomputers, as well as personal computers and office automation software such as spreadsheet, and integrate this information in a single place. This capability, coupled with user-friendly reporting tools and freedom from operational impacts, has led to a growth of this type of computer system.
Advantages of using data warehouse
There are many advantages to using a data warehouse, some of them are:
Enhances end-user access to a wide variety of data.
Business decision makers can obtain various kinds of trend reports e.g. the item with the most sales in a particular area / country for the last two years.
A data warehouse can be a significant enabler of commercial business applications, most notably customer relationship management (CRM) the automation or support of customer processes involving sales or service representatives.
Concerns in using data warehouses
Extracting, cleaning and loading data is time consuming.
Data warehousing project scope must be actively managed to deliver a release of defined content and value.
Problems with compatibility with systems already in place.
Security could develop into a serious issue, especially if the data warehouse is web accessible.
Data Storage design controversy warrants careful consideration and perhaps prototyping of the data warehouse solution for each project's environment.




After reading this Chapter you should be familiar with Data Warehousing and Business Intelligence .
Overview: I would like to explain in plain simple English and then dwell into details about this chapter.
Business Intelligence(BI): You can call this a technology or an application or some practices to support better decission making. This is data driven meaning data that serves as an input to make decisions. so the system that makes these decisions is called Decision Support Systems( also called DSS). Now when you combine the technology, applications for efficient way of reading this data for better understanding its behavior and trends, we call it Business Intelligence.
Technical Savvy definition for BI: Business intelligence (BI) is a broad category of applications and technologies for gathering, storing, analyzing, and providing access to data to help enterprise users make better business decisions. It is a simplification process of discovering information and analyzing that enables every decission maker in an organization to easily access, understand, analyze, collaborate and quickly act on it for better timely decission making.
BI is about extracting information from multiple sources of data, Synthesizing useful knowledge from collections of data, filtering irrelevant informationAnd setting the remaining information in the context of the business and its environment, discovering what plausible actions to be taken or decisions to be Made at different times. To sum up it is a process of using information wisely and keeps you to up to date. And again it is a data driven Decision support System.

Relationship between Business Intelligence and Data warehousing
So why are we talking these two terms together? Are they inter-related? or they both same or what is their relationship?
Many of the tool vendors who sell their products or softwares call it business Intelligence software rather than Data warehousing software. so what is it? Business Intelligence is a term commonly associated with data warehousing. Business Intelligence is a generalized term where a company initiates various activities to gather today's market information which also includes about their competitor. Today's business Intelligence systems are contrasted to more classical way of information gathering in mining and crunching the data in the most optimal manner. In short we can say BI simplifies information discovery and analysis. In this way the company will have a competitive advantage of business and intelligently using the available data in strategic and effective decision making. it has the ability to bring disparate data under one roof with a meaningful information and ready for analysis.
so what has Data warehousing to do with Business Intelligence?
Business intelligence usually refers to the information that is available for the enterprise to make decisions on. A data warehousing (or data mart) system is the backend, or the infrastructural, component for achieving business intelligence. Business intelligence also includes the insight gained from doing data mining analysis, as well as unstructured data (thus the need fo content management systems).
Let me give the path of Data warehousing. All the source data from disparate sources are used to load/Stage data. Different sources can be flat files, another database or some other process. The starting point of the Data warehouse should extract the data in order to load into its environment.This is extracting. This data may not be the expected format or size. your business demands are different or your organization business requirements are different. So the business process has to modify the data or better word is to transform the incoming data to meet requirements and objectives. This is called Transformation. Once every slicing and dicing of the data is done along with applied business rules, this data is ready for loading into the target tables. This process is called Loading. So overall till now we have done Extraction, Transformation and Loading. In short we call this ETL. There are lot of tools available in today's market which does help in achieving the ETL process. Once this data is loaded in to the database, this is ready for next processing. We call that database as Data warehouse database. The next process could be building of datamarts or directly reporting from it. There are lot of tools/software available for reporting/analysis. Some call it business reporting or analysis tool. But if you see the whole process has intelligence involved in business. we can call this or the gurus call it Data warehousing and the system involved from end to end is called business intelligence system.
To sum up the whole BI Systems or DW process. it is the ETL plus Reporting which is called the Data warehousing. Now having read this you should have an idea of what is the whole business involved in Data warehousing. In my next chapter I'll discuss the concepts and terminology used in Data warehousing.
One line difference between Data Warehouse and Business Intelligence:Data Warehousing helps you store the data while business intelligence helps you to control the data for decision making, forecasting etc.
Data warehousing using ETL jobs, will store data in a meaningful form. However, in order to query the data for reporting, forecasting, business intelligence tools were born.



Chapter 3: Concepts of Data Warehousing
Introduction: Concepts of Data warehousing and Business Intelligence is so important for professionals who are working in that subject area projects. Working on a business intelligence (BI) or data warehousing (DW) project can be overwhelming if you don't have a solid grounding in the basics. It's difficult to focus on the goals of the project when you're bogged down by unanswered questions - or don't even know what questions to ask. By arming yourself with knowledge of the concepts and fundamentals, you can hit the ground running. As we know now A data warehouse is a central repository for a subset of data that an enterprise's various business systems collect. Historically, data warehouses were most often used as a central repository to integrate, cleanse and reconcile data used for business intelligence (BI) and analysis. This chapter provides an overview that gives business and information technology professionals the confidence to dive right into their business intelligence and data warehousing activities and contribute to their successDifferent concepts of Data warehouse:
Data Warehouse: A data warehouse is a subject oriented, integrated, time variant, non volatile collection of data in support of management's decision making process.
Data mart: A data mart is a subset of data warehouse that is designed for a particular line of business, such as sales, marketing, or finance. In a dependent data mart, data can be derived from an enterprise-wide data warehouse. In an independent data mart, data can be collected directly from sources. It is specific for a business function. this database contains data summarized at multiple levels of granularity.
Dimension: A category of information. A grouping of data that defines something that we are analyzing.For example, Gender, Office.
Measure: It is the performance Indicator of the data that you are analyzing. For example, Score on a test, Number of bail-outs, percentage of foreclosures, 2008 budget for different offices.
Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.
Data Cleansing: The transformation of data in its current state to a predefined standardized format using software or in other words format the incoming data to your business needs.
Data Extraction: The process of pulling data from external data sources in order to feed it to the data warehouse environment.
Dataflow diagrams(DFDs) : it is a technique which describes the data flow in a business. It could be one system data flow or a chain of systems which are inter related and inter dependent.
Decision Support System: A system that primarily focuses on supporting analytical and tactical reporting. it is a trend analyzer. it is the source of reporting the trends in the business over time with detailed information.
Dimesnion Table: A table that contains discrete values. it contains attributes used to constrain and group data when a query is performed. These attributes may change over the time. For example, the phone number of a person may change. This is referred to as slowly changing dimensions.
Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount would be such a measure. This measure is stored in the fact table with the appropriate granularity. For example, it can be sales amount by store by day. It is usually the primary table that contains measurements(mostly numeric data like grade, wages etc). Lets say we have a column called Primary Names( this is dimension), the values in this Primary Name( Vasu, Harshitha, Rahul, Jawa, Raghu) are the facts. Similarly the salary dimension and the amounts in it are the facts.
Lookup Table: The lookup table provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the quarter, and one or more additional fields that specifies how that particular quarter is represented on a report (for example, first quarter of 2001 may be represented as "Q1 2001" or "2001 Q1").
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
In designing data models for data warehouses / data marts, the most commonly used schema types are Star Schema and Snowflake Schema.
Online Analytical Processing(OLAP): It is the system that analyzes the data to report the business trends. it gives the statistical information of the business retrieved from the Data warehouse. Also called Multidimensional analysis.
Operation Data Store: A system with database structure that is the repository for near real time operational data rather than long term trend data.
Reverse Engineering: It is a reverse process where you start engineering the existing code to write system specifications or to generate documentation.
Schema: It is a model having information about the database implemenatation
Star Schema: In the star schema design, a single object (the fact table) sits in the middle and is radially connected to other surrounding objects (dimension lookup tables) like a star. A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table.
Snowflake Schema: The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables.


to be continued...TERADATA