AuditNet®
Open Source Software for data mining with a Mac
by Mike Blakely
Open Source Software for data mining with a Mac
A number of open source software packages can be run on a Macintosh computer and are well suited for data mining and other audit support functions. This article will provide a brief overview of some of the major packages, describing the functionality of the package which can support commonly performed audit procedures.
First, some definitions or scope limitations. To be included in this article, the software package had to be open source, meaning that the source code for the package is freely available for download by anyone interested. Second, there had to be no license cost for obtaining and using the software. Third, the package had to be able to run on a Macintosh computer. And fourth, the package had to provide at least basic audit procedure capabilities. The term “data mining” is often loosely used, but for the purposes of this article, the meaning is that defined in Wikipedia:
“Data mining (the analysis step of the "Knowledge Discovery in Databases" process, or KDD), a relatively young and interdisciplinary field of computer science, is the process that attempts to discover patterns in large data sets. It utilizes methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use..”
Application of data mining concepts to auditing
Traditionally, data mining concepts have been most often applied outside of auditing, e.g. in marketing, operational planning, scientific research, etc. These applications have most often leveraged data contained in significant databases running on large enterprise computers.
However, many of the same data mining concepts can also be applied to audits of the applications which use the databases. And although it is possible for auditors to audit applications using the underlying databases, that is not often done for a number of reasons such as 1) operational risks for the databases, 2) performance considerations, 3) hesitance by IT operations to allow uncontrolled access to databases, etc. Because of these and other reasons, when auditors need access to production databases, they are generally given a snap shot or data extract from the database to work with. The auditor can then take this snap shot or extract and load the information into a database system controlled by the auditors for further analysis. Such a process has a number of advantages – 1) little or no risk of interference with production data, 2) auditor generally has completely unrestricted access to their own copy of the data, and 3) the auditor can work in a fairly unrestricted fashion.
Types of data mining tests applicable to auditing
Discovery of patterns in datasets will most often be focused on areas such as relationships between data elements, trends over time and unusual or exceptional items which may merit further investigation. Many transaction amounts have an inherent linear relations ship – for example, as sales increase, so too do cost of sales, payroll, and advertising costs. These relationships can be determined using a number of statistical approaches such as linear regression analysis.
Another area which is often of interest is analysis of trends over a time series. Often these will have somewhat random fluctuations, but time series can also be analyzed as well for statistically unusual patterns. Many other types of data mining are based upon obtaining descriptive statistics, identification of duplicates, as well as various categorization procedures such as ageing or data stratification.
Primary Open Source Audit Software Packages
There are at least three open source software packages which may be suitable for data mining using a Mac. The three are Picalo, a Python based system, “R”, a statistical system, and WebCAAT, a web-based system. Each of these systems are described more fully below, along with the author’s opinion of their advantages and disadvantages. There may be other open source systems having audit capabilities which run on the Mac. However, these are the primary systems known to the author.
Picalo
Picalo was developed by Professor Conan Albrecht at Brigham Young University. It is based on the Python language and provides a number of audit capabilities through the use of “scriptlets” which are pre-packaged code segments to perform various audit functions. The software is licensed under the GPL license.
Picalo has the ability to connect with various databases including MySQL, SQLite and others, enabling it to be able to work with very large databases. Data can be imported into Picalo from a variety of sources such as Excel (.xls) files, text files (tab separated and comma separated value format), as well as files having data in a fixed position.
Picalo also has the ability to run user developed scripts, which are developed using the Python computer language. Python is a stable legacy object oriented language which is similar to other object oriented languages such as C++, PHP, etc.
Picalo’s built-in capabilities include detection of duplicates, descriptive statistics, Benford’s Law, and a variety of functions applicable to procurement, payroll, and other application systems through the use of “scriptlets” which are included.
More information about the Picalo project is available at the Picalo site www.picalo.org. The Mac version of the software can be downloaded at http://www.picalo.org/download/current-picalo.dmg The Picalo software also runs on Windows and many of the Unix platforms (Linux, SUSE, Fedora, Ubuntu).
The Picalo software has both advantages and disadvantages which are summed up below:
Advantages
- Extensible – the system is designed handle add-ons (“scriptlets”) as well as a variety of data sources
- Uses Python which is a powerful scripting and support language
- Includes “wizards” to make operation of the system more intuitive
- Platform independent – Windows, Mac, Unix, etc.
Disadvantages
- Lack of substantial support
- There does not appear to be a wide number of “scriptlets” available (other than those provided with the system)
- Works primarily on memory resident tables which could limit functionality
R
R is a comprehensive statistical and graphics package which has been a collaborative development, primarily by college professors of statistics world-wide. It runs on a wide variety of UNIX platforms, as well as Windows and Mac OS X.
Licensing is GPL. It includes an extensive library of statistical functions, including some which are found in very few other places. An example is the function for extrapolation of results obtained from attribute sampling. R can be integrated with a number of databases such as MySQL and SQLite, and is able to handle very large volumes of information.
R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes
- an effective data handling and storage facility,
- a suite of operators for calculations on arrays, in particular matrices,
- a large, coherent, integrated collection of intermediate tools for data analysis,
- graphical facilities for data analysis and display either on-screen or on hardcopy, and
- a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.
R’s built-in capabilities include almost every type of statistical analysis you’ve ever heard of (any many you probably haven’t) , as well as a variety of analytical functions such as linear regression, standard descriptive statistics and other analytical procedures. Output from the system includes both plain text as well as graphics (PNG, JPG, PDF, etc.)
More information about R is available at http://www.r-project.org/
The R software has both advantages and disadvantages which are summed up below:
Advantages
- Very extensive library of statistical (and other) functions
- Legacy system which has been tested fairly extensively and is widely used
- Platform independent – Windows, Mac, Unix, etc.
- Nice graphics capability
Disadvantages
- Fairly steep learning curve (for both the language and the functions)
- Does not include extensive functions specifically tailored to auditing
- Limited graphical user interface (GUI) – primarily command line oriented
Web CAAT
Web CAAT is a web based data analysis package which runs as a web service on a desk top or web server. It uses the PHP scripting language and can be run as either a menu driven system or else through the use of scripts. The system includes a MySQL database which is capable of handling very large collections of data. In order to analyze data, it must first be imported from text files in either tab separated or comma separated value formats.
The system includes a menu driven suite of some fifty commonly used audit applications. In addition, a scripting library is included which provides similar audit functionality. The system also provides specialized functionality for the detection of potential duplicate payments, identification of price variations among purchased goods, as well as basic audit tests for accounts payable, accounts receivable, payroll, journal entries, etc.
The system does not provide any graphics capabilities, although the results can be exported and then re-imported into other packages such as Excel for basic charts.
Licensing is LGPL3. More information about the system is available at http://webcaat.org/wc. The system can also be run on Windows and most variations of Unix. An operational version of the system running on Linux can be viewed at http://webcaat.org/test.
Installation instructions are available at http://webcaat.org/wc/
WebCAAT’s built-in capabilities include about fifty commonly performed audit functions such as descriptive statistics, trend lines, duplicates, gaps, Benford’s Law, round numbers, aging, holidays, etc. In addition a number of specialized functions are included such as vendor/employee cross-checking, Post Office boxes, same, same, different, etc.
The WebCAAT system has both advantages and disadvantages which are summed up below:
Advantages
- Library of pre-built audit functions
- Web based menu driven system shortens the learning curve
- Platform independent – Windows, Mac, Unix, etc.
- Scripting capabilities using the built in libraries and PHP
Disadvantages
- Limited to use with MySQL only
- Does not include any graphics functions
- Lack of support (although there are a number of “how to” articles)
Conclusion
It is feasible to perform audit analytics using a Mac. The author would be interested in hearing of any other open source packages available for the Mac and can be contacted at Mike.Blakley@ezrstats.com.
The opinions, beliefs and viewpoints expressed by
the various authors and forum participants on this web site do not
necessarily reflect the opinions, beliefs and viewpoints of AuditNet®


