AuditNet®
CAATT Tales

Data Mining with IDEA
By Don Sparks, CIA, CISA, ARM
Audimation Services, Inc.
Experts in Data Analysis
Volume 1 Issue 4
What built-in data mining features are included in IDEA? One of the best features often overlooked by users is the Field Statistics function.
Data mining is a term used frequently by auditors that can be a tough term to define as many auditors do not make a distinction between data analysis and data mining, i.e., they see the terms as one and the same. I use the term data analysis when I am referring to managing data to effectively and efficiently conduct routine audit program objectives. Examples are importing troublesome file formats like print and pdf files, joining or appending files, sampling, stratification, gap detection, duplicate detection, pivot tables, Benford’s Law, search features, and many more.
While conducting the analysis, sometimes the auditor’s curiosity may be alerted by an anomaly in the analysis or a red flag. Since the data is already readily available, the auditor may decide to dig beneath the analysis and search for the root cause. This is what I refer to as data mining. Again, using the included features of the software just mentioned the auditor is able to quickly satisfy their curiosity by looking across and down the entire scope of the audit (100% of the data).
Field Statistics
Field statistics is a built in feature of IDEA data analysis software and is activated every time that you import or open a database. The Field Statistics property provides statistical information about all Numeric, Date, and Time fields within the active database. The field statistics are available and displayed for all records in the database, with any applied criteria ignored.
The field statistics give a valuable initial analysis of the database, helping you to gain an understanding of the data and perhaps even identify problems to be further investigated, such as negative inventory items, zero-value check payments, or after-date transactions.
The Field Statistics task is a process that involves generating the statistics and then viewing the statistics. A new feature of IDEA is that some of the field statistics values can be called into your formulas and equations to leverage data that you already have, i.e. @fieldstatistics.
Note: Field statistics are normally generated when importing the database by selecting the
Generate Field Statistics check box on the Import Assistant dialog box. Or for child databases that are created by IDEA tasks, field statistics can be automatically generated if Auto generate field stats is selected in the View > Options dialog box.
However, if they have not been generated, when you select the Field Statistics link in the Properties window, a dialog box displays asking:
Do you wish to create statistics for all fields without statistics?
Click Yes.
Once the statistics have been generated, the results are held in the IDEA Merged or Compound Document (file name.IMD) and can be viewed by clicking the Field Statistics link in the Properties window.
From the Field Statistics window, you can:
• Extract the records for a particular statistic (click on the hyperlink).
You can extract the records associated with a particular statistic if the statistic is displayed as a link (blue and underlined). To extract the records, click the link. This displays a new window with the relevant records. From here you can then save the records as a new database or print it.
• Select the field (numeric, data, time) for which statistics are to be displayed (see the example screen shots that follow each field statistic list).
a. Numeric field
- Net Value: Total value of the field, with sign (positive or negative) considered.
- Absolute Value: Total of the absolute of the values in the field.
- # of Records: Total number of records in the database.
- # of Zero Items: Number of records with a zero value in the field.
- Positive Value: Total of debit (positive) values in the field.
- Negative Value: Total of credit (negative) values in the field.
- # of Positive Records: Number of records with debit (positive) values in the field.
- # of Negative Records: Number of records with credit (negative) values in the field.
- # of Data Errors: Number of records with errors in the field.
- # of Valid Values: Number of records with valid values in the field.
- Average Value: Average of values, such as net value/number of records, in the field.
- Minimum Value: Lowest value in the field.
- Maximum Value: Highest value in the field.
- Record # of Min: Record number of the record that contains the minimum value in the field. If more than one record contains the minimum value, the record number of the first record that contains the minimum value in the field will be displayed.
- Record # of Max: Record number of the record that contains the maximum value in the field. If more than one record contains the maximum value, the record number of the first record that contains the maximum value in the field will be displayed.
- Sample Std Dev: A numeric value that represents the distribution of data in a selected sample.
- Sample Variance: A numeric value that indicates the dispersion of data around its mean value in a selected sample.
- Pop Std Dev: A numeric value that represents the distribution of the data.
- Pop Variance: A numeric value that indicates the dispersion of data around its mean value.
- Pop Skewness: A numeric value that indicates the asymmetry of the data.
- Pop Kurtosis: A numeric value that indicates the shape of the distribution of data values.

OR...
b. Date field

- # of Valid Values: Number of records with valid values in the field.
- # of Zero Items: Number of records with a zero value in the field.
- # of Records: Total number of records in the database.
- # of Data Errors: Number of records with errors in the field.
- Earliest Date: Earliest date in the field. Useful for testing cut-off dates. Note: When used with @FieldStatistics , the earliest date is returned in Julian format.
- Latest Date: Latest date in the field. Useful for testing a cut-off dates. Note: When used with @FieldStatistics , the latest date is returned in Julian format.
- Record # of Earliest: Record number of the record that contains the earliest date in the field. If more than one record contains the earliest date, the record number of the first record that contains the earliest date in the field will be displayed.
- Record # of Latest: Record number of the record that contains the latest date in the field. If more than one record contains the latest date, the record number of the first record that contains the latest date in the field will be displayed.
- Most Common Day: The day of the week that most frequently appears in the date values in the field.
- Most Common Month: The month that most frequently appears in the date values in the field.
- Items in Month: Number of records with values that contain the particular month.
- Items on Day: Number of records with values that contain the particular day of the week.

OR...
C. Time field

- # of Records: Total number of records in the database.
- # of Valid Values: Number of records with valid values in the field.
- # of Data Errors: Number of records with errors in the field.
- # of Zero Items: Number of records with a zero value in the field.
- Latest Time: Latest time value in the field. Note: When used with @FieldStatistics , the latest time is returned in number of seconds.
- Earliest Time: Earliest time value in the field. Note: When used with @FieldStatistics , the earliest time is returned in number of seconds.
- Most Common Hour: The hour that most frequently appears in the time values in the field.
- Most Common Minute: The minute that most frequently appears in the time values in the field.
- Most Common Second: The second that most frequently appears in the time values in the field.
- # of Records in AM: The number of records in the field that have a time value that falls before 12 noon.
- # of Records in PM: The number of records in the field that have a time value that falls after 12 noon.
- # of Records Before 6 AM: The number of records with a time value that falls before 06:00 in the morning.
- # of Records After 6 PM: The number of records with a time value that falls after 06:00 in the evening.
- Average Time: The average time value for the field. Note: When used with @FieldStatistics , the average time is returned in number of seconds.
- # of Records Less Than a Day: The number of records in the field that are less than 24 hours in duration.
- # of Records More Than a Day: The number of records in the field that are equal to or more than 24 hours in duration.

To
respond to this solution or provide data analysis questions you would
like answered in future newsletter articles, please send an email to
dons@audimation.com
The opinions, beliefs and viewpoints expressed by the various authors and forum participants on this web site do not necessarily reflect the opinions, beliefs and viewpoints of AuditNet®

