aboutsummaryrefslogtreecommitdiff
path: root/blog/2021-08-25-audit-sampling.org
diff options
context:
space:
mode:
Diffstat (limited to 'blog/2021-08-25-audit-sampling.org')
-rw-r--r--blog/2021-08-25-audit-sampling.org293
1 files changed, 154 insertions, 139 deletions
diff --git a/blog/2021-08-25-audit-sampling.org b/blog/2021-08-25-audit-sampling.org
index 8283199..ac6f157 100644
--- a/blog/2021-08-25-audit-sampling.org
+++ b/blog/2021-08-25-audit-sampling.org
@@ -1,52 +1,55 @@
-+++
-date = 2021-08-25
-title = "Audit Sampling with Python"
-description = "Learn how to use Python to automate the boring parts of audit sampling."
-draft = false
-+++
-
-## Introduction
+#+title: Audit Sampling with Python
+#+date: 2021-08-25
+** Introduction
+:PROPERTIES:
+:CUSTOM_ID: introduction
+:END:
For anyone who is familiar with internal auditing, external auditing, or
-consulting, you will understand how tedious audit testing can become when you
-are required to test large swaths of data. When we cannot establish an automated
-means of testing an entire population, we generate samples to represent the
-population of data. This helps ensure we can have a small enough data pool to
-test and that our results still represent the population.
-
-However, sampling data within the world of audit still seems to confuse quite a
-lot of people. While some audit-focused tools have introduced sampling
-functionality (e.g. Wdesk), many audit departments and firms cannot use software
-like this due to certain constraints, such as the team's budget or knowledge.
-Here is where this article comes in: we're going to use
-[Python](https://www.python.org), a free and open-source programming language,
-to generate random samples from a dataset in order to suffice numerous audit
-situations.
-
-## Audit Requirements for Sampling
-
-Before we get into the details of how to sample with Python, I want to make sure
-I discuss the different requirements that auditors may have of samples used
-within their projects.
-
-### Randomness
-
-First, let's discuss randomness. When testing out new technology to help assist
-with audit sampling, you need to understand exactly how your samples are being
-generated. For example, if the underlying function is just picking every 57th
-element from a list, that's not truly random; it's a systematic form of
-sampling. Luckily, since Python is open-source, we have access to its codebase.
-Through this blog post, I will be using the [pandas](https://pandas.pydata.org)
-module in order to generate the random samples. More specifically, I will be
-using the
-[pandas.DataFrame.sample](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html)
+consulting, you will understand how tedious audit testing can become
+when you are required to test large swaths of data. When we cannot
+establish an automated means of testing an entire population, we
+generate samples to represent the population of data. This helps ensure
+we can have a small enough data pool to test and that our results still
+represent the population.
+
+However, sampling data within the world of audit still seems to confuse
+quite a lot of people. While some audit-focused tools have introduced
+sampling functionality (e.g. Wdesk), many audit departments and firms
+cannot use software like this due to certain constraints, such as the
+team's budget or knowledge. Here is where this article comes in: we're
+going to use [[https://www.python.org][Python]], a free and open-source
+programming language, to generate random samples from a dataset in order
+to suffice numerous audit situations.
+
+** Audit Requirements for Sampling
+:PROPERTIES:
+:CUSTOM_ID: audit-requirements-for-sampling
+:END:
+Before we get into the details of how to sample with Python, I want to
+make sure I discuss the different requirements that auditors may have of
+samples used within their projects.
+
+*** Randomness
+:PROPERTIES:
+:CUSTOM_ID: randomness
+:END:
+First, let's discuss randomness. When testing out new technology to help
+assist with audit sampling, you need to understand exactly how your
+samples are being generated. For example, if the underlying function is
+just picking every 57th element from a list, that's not truly random;
+it's a systematic form of sampling. Luckily, since Python is
+open-source, we have access to its codebase. Through this blog post, I
+will be using the [[https://pandas.pydata.org][pandas]] module in order
+to generate the random samples. More specifically, I will be using the
+[[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html][pandas.DataFrame.sample]]
function provided by Pandas.
-Now that you know what you're using, you can always check out the code behind
-`pandas.DataFrame.sample`. This function does a lot of work, but we really only
-care about the following snippets of code:
+Now that you know what you're using, you can always check out the code
+behind =pandas.DataFrame.sample=. This function does a lot of work, but
+we really only care about the following snippets of code:
-```python
+#+begin_src python
# Process random_state argument
rs = com.random_state(random_state)
@@ -58,45 +61,45 @@ if ignore_index:
result.index = ibase.default_index(len(result))
return result
-```
-
-The block of code above shows you that if you assign a `random_state`
-argument when you run the function, that will be used as a seed number in
-the random generation and will allow you to reproduce a sample, given that
-nothing else changes.
-This is critical to the posterity of audit work.
-After all, how can you say your audit process is adequately documented if
-the next person can't run the code and get the same sample?
-The final piece here on randomness is to look at the [choice](https://docs.
-python.org/3/library/random.html#random.choice) function used above.
-This is the crux of the generation and can also be examined for more
-detailed analysis on its reliability.
-As far as auditing goes, we will trust that these functions are
-mathematically random.
-
-### Sample Sizes
-
-As mentioned in the intro, sampling is only an effective method of auditing
-when it truly represents the entire population.
-While some audit departments or firms may consider certain judgmental sample
-sizes to be adequate, you may need to rely on statistically-significant
-confidence levels of sample testing at certain points.
-I will demonstrate both here.
-For statistically-significant confidence levels, most people will assume a
-90% - 99% confidence level.
-In order to actually calculate the correct sample size, it is best to use
-statistical tools due to the tedious math work required.
-For example, for a population of 1000, and a 90% confidence level that no
-more than 5% of the items are nonconforming, you would sample 45 items.
-
-However, in my personal experience, many audit departments and firms do not use
-statistical sampling. Most people use a predetermined, often proprietary, table
-that will instruct auditors which sample sizes to choose. This allows for
-uniform testing and reduces overall workload. See the table below for a common
-implementation of sample sizes:
+#+end_src
+
+The block of code above shows you that if you assign a =random_state=
+argument when you run the function, that will be used as a seed number
+in the random generation and will allow you to reproduce a sample, given
+that nothing else changes. This is critical to the posterity of audit
+work. After all, how can you say your audit process is adequately
+documented if the next person can't run the code and get the same
+sample? The final piece here on randomness is to look at the
+[[https://docs.%20python.org/3/library/random.html#random.choice][choice]]
+function used above. This is the crux of the generation and can also be
+examined for more detailed analysis on its reliability. As far as
+auditing goes, we will trust that these functions are mathematically
+random.
+
+*** Sample Sizes
+:PROPERTIES:
+:CUSTOM_ID: sample-sizes
+:END:
+As mentioned in the intro, sampling is only an effective method of
+auditing when it truly represents the entire population. While some
+audit departments or firms may consider certain judgmental sample sizes
+to be adequate, you may need to rely on statistically-significant
+confidence levels of sample testing at certain points. I will
+demonstrate both here. For statistically-significant confidence levels,
+most people will assume a 90% - 99% confidence level. In order to
+actually calculate the correct sample size, it is best to use
+statistical tools due to the tedious math work required. For example,
+for a population of 1000, and a 90% confidence level that no more than
+5% of the items are nonconforming, you would sample 45 items.
+
+However, in my personal experience, many audit departments and firms do
+not use statistical sampling. Most people use a predetermined, often
+proprietary, table that will instruct auditors which sample sizes to
+choose. This allows for uniform testing and reduces overall workload.
+See the table below for a common implementation of sample sizes:
| Control Frequency | Sample Size - High Risk | Sample Size - Low Risk |
-|-------------------|-------------------------|------------------------|
+|-------------------+-------------------------+------------------------|
| More Than Daily | 40 | 25 |
| Daily | 40 | 25 |
| Weekly | 12 | 5 |
@@ -106,23 +109,27 @@ implementation of sample sizes:
| Annually | 1 | 1 |
| Ad-hoc | 1 | 1 |
-## Sampling with Python & Pandas
-
-In this section, I am going to cover a few basic audit situations that require
-sampling. While some situations may require more effort, the syntax,
-organization, and intellect used remain largely the same. If you've never used
-Python before, note that lines starting with a '`#`' symbol are called comments,
-and they will be skipped by Python. I highly recommend taking a quick tutorial
-online to understand the basics of Python if any of the code below is confusing
-to you.
-
-### Simple Random Sample
-
-First, let's look at a simple, random sample. The code block below will import
-the `pandas` module, load a data file, sample the data, and export the sample to
-a file.
-
-```python
+** Sampling with Python & Pandas
+:PROPERTIES:
+:CUSTOM_ID: sampling-with-python-pandas
+:END:
+In this section, I am going to cover a few basic audit situations that
+require sampling. While some situations may require more effort, the
+syntax, organization, and intellect used remain largely the same. If
+you've never used Python before, note that lines starting with a '=#='
+symbol are called comments, and they will be skipped by Python. I highly
+recommend taking a quick tutorial online to understand the basics of
+Python if any of the code below is confusing to you.
+
+*** Simple Random Sample
+:PROPERTIES:
+:CUSTOM_ID: simple-random-sample
+:END:
+First, let's look at a simple, random sample. The code block below will
+import the =pandas= module, load a data file, sample the data, and
+export the sample to a file.
+
+#+begin_src python
# Import the Pandas module
import pandas
@@ -140,14 +147,16 @@ sample = df.sample(n=25, random_state=0)
# Save the sample to Excel
sample.to_excel(file_output)
-```
+#+end_src
-### Simple Random Sample: Using Multiple Input Files
+*** Simple Random Sample: Using Multiple Input Files
+:PROPERTIES:
+:CUSTOM_ID: simple-random-sample-using-multiple-input-files
+:END:
+Now that we've created a simple sample, let's create a sample from
+multiple files.
-Now that we've created a simple sample, let's create a sample from multiple
-files.
-
-```python
+#+begin_src python
# Import the Pandas module
import pandas
@@ -174,15 +183,17 @@ sample = pandas.concat([sample_01, sample_02, sample_03], ignore_index=True)
# Save the sample to Excel
sample.to_excel(file_output)
-```
-
-### Stratified Random Sample
+#+end_src
-Well, what if you need to sample distinct parts of a single file? For example,
-let's write some code to separate our data by "Region" and sample those regions
-independently.
+*** Stratified Random Sample
+:PROPERTIES:
+:CUSTOM_ID: stratified-random-sample
+:END:
+Well, what if you need to sample distinct parts of a single file? For
+example, let's write some code to separate our data by "Region" and
+sample those regions independently.
-```python
+#+begin_src python
# Import the Pandas module
import pandas
@@ -208,16 +219,18 @@ sample = pandas.concat([sample_east, sample_west], ignore_index=True)
# Save the sample to Excel
sample.to_excel(file_output)
-```
-
-### Stratified Systematic Sample
-
-This next example is quite useful if you need audit coverage over a certain time
-period. This code will generate samples for each month in the data and combine
-them all together at the end. Obviously, this code can be modified to stratify
-by something other than months, if needed.
-
-```python
+#+end_src
+
+*** Stratified Systematic Sample
+:PROPERTIES:
+:CUSTOM_ID: stratified-systematic-sample
+:END:
+This next example is quite useful if you need audit coverage over a
+certain time period. This code will generate samples for each month in
+the data and combine them all together at the end. Obviously, this code
+can be modified to stratify by something other than months, if needed.
+
+#+begin_src python
# Import the Pandas module
import pandas
@@ -257,21 +270,23 @@ def monthly_stratified_sample(df: pandas.DataFrame, date_column: str, num_select
sample_size = 3
sample = monthly_stratified_sample(df, 'Date of Sale', sample_size)
sample.to_excel(file_output)
-```
-
-## Documenting the Results
-
-Once you've generated a proper sample, there are a few things left to do in
-order to properly ensure your process is reproducible.
-
-1. Document the sample. Make sure the resulting file is readable and includes
- the documentation listed in the next bullet.
-2. Include documentation around the data source, extraction techniques, any
- modifications made to the data, and be sure to include a copy of the script
- itself.
-3. Whenever possible, perform a completeness and accuracy test to ensure your
- sample is coming from a complete and accurate population. To ensure
- completeness, compare the record count from the data source to the record
- count loaded into Python. To ensure accuracy, test a small sample against the
- source data (e.g., test 5 sales against the database to see if the details are
- accurate).
+#+end_src
+
+** Documenting the Results
+:PROPERTIES:
+:CUSTOM_ID: documenting-the-results
+:END:
+Once you've generated a proper sample, there are a few things left to do
+in order to properly ensure your process is reproducible.
+
+1. Document the sample. Make sure the resulting file is readable and
+ includes the documentation listed in the next bullet.
+2. Include documentation around the data source, extraction techniques,
+ any modifications made to the data, and be sure to include a copy of
+ the script itself.
+3. Whenever possible, perform a completeness and accuracy test to ensure
+ your sample is coming from a complete and accurate population. To
+ ensure completeness, compare the record count from the data source to
+ the record count loaded into Python. To ensure accuracy, test a small
+ sample against the source data (e.g., test 5 sales against the
+ database to see if the details are accurate).