diff options
Diffstat (limited to 'blog/2021-08-25-audit-sampling.org')
-rw-r--r-- | blog/2021-08-25-audit-sampling.org | 293 |
1 files changed, 154 insertions, 139 deletions
diff --git a/blog/2021-08-25-audit-sampling.org b/blog/2021-08-25-audit-sampling.org index 8283199..ac6f157 100644 --- a/blog/2021-08-25-audit-sampling.org +++ b/blog/2021-08-25-audit-sampling.org @@ -1,52 +1,55 @@ -+++ -date = 2021-08-25 -title = "Audit Sampling with Python" -description = "Learn how to use Python to automate the boring parts of audit sampling." -draft = false -+++ - -## Introduction +#+title: Audit Sampling with Python +#+date: 2021-08-25 +** Introduction +:PROPERTIES: +:CUSTOM_ID: introduction +:END: For anyone who is familiar with internal auditing, external auditing, or -consulting, you will understand how tedious audit testing can become when you -are required to test large swaths of data. When we cannot establish an automated -means of testing an entire population, we generate samples to represent the -population of data. This helps ensure we can have a small enough data pool to -test and that our results still represent the population. - -However, sampling data within the world of audit still seems to confuse quite a -lot of people. While some audit-focused tools have introduced sampling -functionality (e.g. Wdesk), many audit departments and firms cannot use software -like this due to certain constraints, such as the team's budget or knowledge. -Here is where this article comes in: we're going to use -[Python](https://www.python.org), a free and open-source programming language, -to generate random samples from a dataset in order to suffice numerous audit -situations. - -## Audit Requirements for Sampling - -Before we get into the details of how to sample with Python, I want to make sure -I discuss the different requirements that auditors may have of samples used -within their projects. - -### Randomness - -First, let's discuss randomness. When testing out new technology to help assist -with audit sampling, you need to understand exactly how your samples are being -generated. For example, if the underlying function is just picking every 57th -element from a list, that's not truly random; it's a systematic form of -sampling. Luckily, since Python is open-source, we have access to its codebase. -Through this blog post, I will be using the [pandas](https://pandas.pydata.org) -module in order to generate the random samples. More specifically, I will be -using the -[pandas.DataFrame.sample](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) +consulting, you will understand how tedious audit testing can become +when you are required to test large swaths of data. When we cannot +establish an automated means of testing an entire population, we +generate samples to represent the population of data. This helps ensure +we can have a small enough data pool to test and that our results still +represent the population. + +However, sampling data within the world of audit still seems to confuse +quite a lot of people. While some audit-focused tools have introduced +sampling functionality (e.g. Wdesk), many audit departments and firms +cannot use software like this due to certain constraints, such as the +team's budget or knowledge. Here is where this article comes in: we're +going to use [[https://www.python.org][Python]], a free and open-source +programming language, to generate random samples from a dataset in order +to suffice numerous audit situations. + +** Audit Requirements for Sampling +:PROPERTIES: +:CUSTOM_ID: audit-requirements-for-sampling +:END: +Before we get into the details of how to sample with Python, I want to +make sure I discuss the different requirements that auditors may have of +samples used within their projects. + +*** Randomness +:PROPERTIES: +:CUSTOM_ID: randomness +:END: +First, let's discuss randomness. When testing out new technology to help +assist with audit sampling, you need to understand exactly how your +samples are being generated. For example, if the underlying function is +just picking every 57th element from a list, that's not truly random; +it's a systematic form of sampling. Luckily, since Python is +open-source, we have access to its codebase. Through this blog post, I +will be using the [[https://pandas.pydata.org][pandas]] module in order +to generate the random samples. More specifically, I will be using the +[[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html][pandas.DataFrame.sample]] function provided by Pandas. -Now that you know what you're using, you can always check out the code behind -`pandas.DataFrame.sample`. This function does a lot of work, but we really only -care about the following snippets of code: +Now that you know what you're using, you can always check out the code +behind =pandas.DataFrame.sample=. This function does a lot of work, but +we really only care about the following snippets of code: -```python +#+begin_src python # Process random_state argument rs = com.random_state(random_state) @@ -58,45 +61,45 @@ if ignore_index: result.index = ibase.default_index(len(result)) return result -``` - -The block of code above shows you that if you assign a `random_state` -argument when you run the function, that will be used as a seed number in -the random generation and will allow you to reproduce a sample, given that -nothing else changes. -This is critical to the posterity of audit work. -After all, how can you say your audit process is adequately documented if -the next person can't run the code and get the same sample? -The final piece here on randomness is to look at the [choice](https://docs. -python.org/3/library/random.html#random.choice) function used above. -This is the crux of the generation and can also be examined for more -detailed analysis on its reliability. -As far as auditing goes, we will trust that these functions are -mathematically random. - -### Sample Sizes - -As mentioned in the intro, sampling is only an effective method of auditing -when it truly represents the entire population. -While some audit departments or firms may consider certain judgmental sample -sizes to be adequate, you may need to rely on statistically-significant -confidence levels of sample testing at certain points. -I will demonstrate both here. -For statistically-significant confidence levels, most people will assume a -90% - 99% confidence level. -In order to actually calculate the correct sample size, it is best to use -statistical tools due to the tedious math work required. -For example, for a population of 1000, and a 90% confidence level that no -more than 5% of the items are nonconforming, you would sample 45 items. - -However, in my personal experience, many audit departments and firms do not use -statistical sampling. Most people use a predetermined, often proprietary, table -that will instruct auditors which sample sizes to choose. This allows for -uniform testing and reduces overall workload. See the table below for a common -implementation of sample sizes: +#+end_src + +The block of code above shows you that if you assign a =random_state= +argument when you run the function, that will be used as a seed number +in the random generation and will allow you to reproduce a sample, given +that nothing else changes. This is critical to the posterity of audit +work. After all, how can you say your audit process is adequately +documented if the next person can't run the code and get the same +sample? The final piece here on randomness is to look at the +[[https://docs.%20python.org/3/library/random.html#random.choice][choice]] +function used above. This is the crux of the generation and can also be +examined for more detailed analysis on its reliability. As far as +auditing goes, we will trust that these functions are mathematically +random. + +*** Sample Sizes +:PROPERTIES: +:CUSTOM_ID: sample-sizes +:END: +As mentioned in the intro, sampling is only an effective method of +auditing when it truly represents the entire population. While some +audit departments or firms may consider certain judgmental sample sizes +to be adequate, you may need to rely on statistically-significant +confidence levels of sample testing at certain points. I will +demonstrate both here. For statistically-significant confidence levels, +most people will assume a 90% - 99% confidence level. In order to +actually calculate the correct sample size, it is best to use +statistical tools due to the tedious math work required. For example, +for a population of 1000, and a 90% confidence level that no more than +5% of the items are nonconforming, you would sample 45 items. + +However, in my personal experience, many audit departments and firms do +not use statistical sampling. Most people use a predetermined, often +proprietary, table that will instruct auditors which sample sizes to +choose. This allows for uniform testing and reduces overall workload. +See the table below for a common implementation of sample sizes: | Control Frequency | Sample Size - High Risk | Sample Size - Low Risk | -|-------------------|-------------------------|------------------------| +|-------------------+-------------------------+------------------------| | More Than Daily | 40 | 25 | | Daily | 40 | 25 | | Weekly | 12 | 5 | @@ -106,23 +109,27 @@ implementation of sample sizes: | Annually | 1 | 1 | | Ad-hoc | 1 | 1 | -## Sampling with Python & Pandas - -In this section, I am going to cover a few basic audit situations that require -sampling. While some situations may require more effort, the syntax, -organization, and intellect used remain largely the same. If you've never used -Python before, note that lines starting with a '`#`' symbol are called comments, -and they will be skipped by Python. I highly recommend taking a quick tutorial -online to understand the basics of Python if any of the code below is confusing -to you. - -### Simple Random Sample - -First, let's look at a simple, random sample. The code block below will import -the `pandas` module, load a data file, sample the data, and export the sample to -a file. - -```python +** Sampling with Python & Pandas +:PROPERTIES: +:CUSTOM_ID: sampling-with-python-pandas +:END: +In this section, I am going to cover a few basic audit situations that +require sampling. While some situations may require more effort, the +syntax, organization, and intellect used remain largely the same. If +you've never used Python before, note that lines starting with a '=#=' +symbol are called comments, and they will be skipped by Python. I highly +recommend taking a quick tutorial online to understand the basics of +Python if any of the code below is confusing to you. + +*** Simple Random Sample +:PROPERTIES: +:CUSTOM_ID: simple-random-sample +:END: +First, let's look at a simple, random sample. The code block below will +import the =pandas= module, load a data file, sample the data, and +export the sample to a file. + +#+begin_src python # Import the Pandas module import pandas @@ -140,14 +147,16 @@ sample = df.sample(n=25, random_state=0) # Save the sample to Excel sample.to_excel(file_output) -``` +#+end_src -### Simple Random Sample: Using Multiple Input Files +*** Simple Random Sample: Using Multiple Input Files +:PROPERTIES: +:CUSTOM_ID: simple-random-sample-using-multiple-input-files +:END: +Now that we've created a simple sample, let's create a sample from +multiple files. -Now that we've created a simple sample, let's create a sample from multiple -files. - -```python +#+begin_src python # Import the Pandas module import pandas @@ -174,15 +183,17 @@ sample = pandas.concat([sample_01, sample_02, sample_03], ignore_index=True) # Save the sample to Excel sample.to_excel(file_output) -``` - -### Stratified Random Sample +#+end_src -Well, what if you need to sample distinct parts of a single file? For example, -let's write some code to separate our data by "Region" and sample those regions -independently. +*** Stratified Random Sample +:PROPERTIES: +:CUSTOM_ID: stratified-random-sample +:END: +Well, what if you need to sample distinct parts of a single file? For +example, let's write some code to separate our data by "Region" and +sample those regions independently. -```python +#+begin_src python # Import the Pandas module import pandas @@ -208,16 +219,18 @@ sample = pandas.concat([sample_east, sample_west], ignore_index=True) # Save the sample to Excel sample.to_excel(file_output) -``` - -### Stratified Systematic Sample - -This next example is quite useful if you need audit coverage over a certain time -period. This code will generate samples for each month in the data and combine -them all together at the end. Obviously, this code can be modified to stratify -by something other than months, if needed. - -```python +#+end_src + +*** Stratified Systematic Sample +:PROPERTIES: +:CUSTOM_ID: stratified-systematic-sample +:END: +This next example is quite useful if you need audit coverage over a +certain time period. This code will generate samples for each month in +the data and combine them all together at the end. Obviously, this code +can be modified to stratify by something other than months, if needed. + +#+begin_src python # Import the Pandas module import pandas @@ -257,21 +270,23 @@ def monthly_stratified_sample(df: pandas.DataFrame, date_column: str, num_select sample_size = 3 sample = monthly_stratified_sample(df, 'Date of Sale', sample_size) sample.to_excel(file_output) -``` - -## Documenting the Results - -Once you've generated a proper sample, there are a few things left to do in -order to properly ensure your process is reproducible. - -1. Document the sample. Make sure the resulting file is readable and includes - the documentation listed in the next bullet. -2. Include documentation around the data source, extraction techniques, any - modifications made to the data, and be sure to include a copy of the script - itself. -3. Whenever possible, perform a completeness and accuracy test to ensure your - sample is coming from a complete and accurate population. To ensure - completeness, compare the record count from the data source to the record - count loaded into Python. To ensure accuracy, test a small sample against the - source data (e.g., test 5 sales against the database to see if the details are - accurate). +#+end_src + +** Documenting the Results +:PROPERTIES: +:CUSTOM_ID: documenting-the-results +:END: +Once you've generated a proper sample, there are a few things left to do +in order to properly ensure your process is reproducible. + +1. Document the sample. Make sure the resulting file is readable and + includes the documentation listed in the next bullet. +2. Include documentation around the data source, extraction techniques, + any modifications made to the data, and be sure to include a copy of + the script itself. +3. Whenever possible, perform a completeness and accuracy test to ensure + your sample is coming from a complete and accurate population. To + ensure completeness, compare the record count from the data source to + the record count loaded into Python. To ensure accuracy, test a small + sample against the source data (e.g., test 5 sales against the + database to see if the details are accurate). |