diff --git a/_config.yml b/_config.yml index 2ec03d3a8..8d159fdfd 100644 --- a/_config.yml +++ b/_config.yml @@ -92,10 +92,12 @@ extras_order: - figures - guide - common-issues - - discuss + - refactor-1-software-design + - refactor-2-code-refactoring + - refactor-3-code-abstractions + - refactor-4-architecture-revisited - protect-main-branch - vscode - - functional-programming - persistence - databases - geopandas diff --git a/_extras/refactor-1-software-design.md b/_extras/refactor-1-software-design.md new file mode 100644 index 000000000..4941bb4a3 --- /dev/null +++ b/_extras/refactor-1-software-design.md @@ -0,0 +1,241 @@ +--- +title: "Refactor 1: Software Design" +teaching: 25 +exercises: 20 +questions: +- "Why should we invest time in software design?" +- "What should we consider when designing software?" +objectives: +- "Understand the goals and principles of designing 'good' software." +- "Understand code decoupling and code abstraction design techniques." +- "Understand what code refactoring is." +keypoints: +- "'Good' code is designed to be maintainable: readable by people who did not author the code, +testable through a set of automated tests, adaptable to new requirements." +- "The sooner you adopt a practice of designing your software in the lifecycle of your project, +the easier the development and maintenance process will." +--- + +## Introduction + +Ideally, we should have at least a rough design of our software sketched out +before we write a single line of code. +This design should be based around the requirements and the structure of the problem we are trying +to solve: what are the concepts we need to represent in our code +and what are the relationships between them. +And importantly, who will be using our software and how will they interact with it. + +As a piece of software grows, +it will reach a point where there is too much code for us to keep in mind at once. +At this point, it becomes particularly important to think of the overall design and +structure of our software, how should all the pieces of functionality fit together, +and how should we work towards fulfilling this overall design throughout development. +Even if you did not think about the design of your software from the very beginning - +it is not too late to start now. + +It is not easy to come up with a complete definition for the term **software design**, +but some of the common aspects are: + +- **Algorithm design** - + what method are we going to use to solve the core research/business problem? +- **Software architecture** - + what components will the software have and how will they cooperate? +- **System architecture** - + what other things will this software have to interact with and how will it do this? +- **UI/UX** (User Interface / User Experience) - + how will users interact with the software? + +There is literature on each of the above software design aspects - we will not go into details of +them all here. +Instead, we will learn some techniques to structure our code better to satisfy some of the +requirements of 'good' software and revisit +our software's [MVC architecture](/11-software-project/index.html#software-architecture) +in the context of software design. + +## Good Software Design Goals +Aspirationally, what makes good code can be summarised in the following quote from the +[Intent HG blog](https://intenthq.com/blog/it-audience/what-is-good-code-a-scientific-definition/): + +> *“Good code is written so that is readable, understandable, +> covered by automated tests, not over complicated +> and does well what is intended to do.”* + +Software has become a crucial aspect of reproducible research, as well as an asset that +can be reused or repurposed. +Thus, it is even more important to take time to design the software to be easily *modifiable* and +*extensible*, to save ourselves and our team a lot of time later on when we have +to fix a problem or the software's requirements change. + +Satisfying the above properties will lead to an overall software design +goal of having *maintainable* code, which is: + +* *readable* (and understandable) by developers who did not write the code, e.g. by: + * following a consistent coding style and naming conventions + * using meaningful and descriptive names for variables, functions, and classes + * documenting code to describe it does and how it may be used + * using simple control flow to make it easier to follow the code execution + * keeping functions and methods small and focused on a single task (also important for testing) +* *testable* through a set of (preferably automated) tests, e.g. by: + * writing unit, functional, regression tests to verify the code produces + the expected outputs from controlled inputs and exhibits the expected behavior over time + as the code changes +* *adaptable* (easily modifiable and extensible) to satisfy new requirements, e.g. by: + * writing low-coupled/decoupled code where each part of the code has a separate concern and + the lowest possible dependency on other parts of the code making it + easier to test, update or replace - e.g. by separating the "business logic" and "presentation" + layers of the code on the architecture level (recall the [MVC architecture](/11-software-project/index.html#software-architecture)), + or separating "pure" (without side-effects) and "impure" (with side-effects) parts of the code on the + level of functions. + +Now that we know what goals we should aspire to, let us take a critical look at the code in our +software project and try to identify ways in which it can be improved. + +Our software project contains a branch `full-data-analysis` with code for a new feature of our +inflammation analysis software. Recall that you can see all your branches as follows: +~~~ +$ git branch --all +~~~ +{: .language-bash} + +Let's checkout a new local branch from the `full-data-analysis` branch, making sure we +have saved and committed all current changes before doing so. + +~~~ +git checkout -b full-data-analysis +~~~ +{: .language-bash} + +This new feature enables user to pass a new command-line parameter `--full-data-analysis` causing +the software to find the directory containing the first input data file (provided via command line +parameter `infiles`) and invoke the data analysis over all the data files in that directory. +This bit of functionality is handled by `catchment-analysis.py` in the project root. E.g. +```bash +python catchment-analysis.py data/rain_data_small.csv --full-data-analysis +``` + +The new data analysis code is located in `compute_data.py` file within the `catchment` directory +in a function called `analyse_data()`. +This function loads all the data files for a given a directory path, then +calculates and compares standard deviation across all the data by day and finally plots a graph. + +> ## Exercise: Identifying How Code Can be Improved? +> Critically examine the code in `analyse_data()` function in `compute_data.py` file. +> +> In what ways does this code not live up to the ideal properties of 'good' code? +> Think about ways in which you find it hard to understand. +> Think about the kinds of changes you might want to make to it, and what would +> make making those changes challenging. +>> ## Solution +>> You may have found others, but here are some of the things that make the code +>> hard to read, test and maintain. +>> +>> * **Hard to read:** everything is implemented in a single function. +>> In order to understand it, you need to understand how file loading works at the same time as +>> the analysis itself. +>> * **Hard to modify:** if you wanted to use the data for some other purpose and not just +>> plotting the graph you would have to change the `data_analysis()` function. +>> * **Hard to modify or test:** it always analyses a fixed set of CSV data files +>> within whichever directory it accesses, not always the file that is given as an argument. +>> * **Hard to modify:** it does not have any tests so we cannot be 100% confident the code does +>> what it claims to do; any changes to the code may break something and it would be harder and +>> more time-consuming to figure out what. +>> +>> Make sure to keep the list you have created in the exercise above. +>> For the remainder of this section, we will work on improving this code. +>> At the end, we will revisit your list to check that you have learnt ways to address each of the +>> problems you had found. +>> +>> There may be other things to improve with the code on this branch, e.g. how command line +>> parameters are being handled in `catchment-analysis.py`, but we are focussing on +>> `analyse_data()` function for the time being. +> {: .solution} +{: .challenge} + +## Poor Design Choices & Technical Debt + +When faced with a problem that you need to solve by writing code - it may be tempted to +skip the design phase and dive straight into coding. +What happens if you do not follow the good software design and development best practices? +It can lead to accumulated 'technical debt', +which (according to [Wikipedia](https://en.wikipedia.org/wiki/Technical_debt)), +is the "cost of additional rework caused by choosing an easy (limited) solution now +instead of using a better approach that would take longer". +The pressure to achieve project goals can sometimes lead to quick and easy solutions, +which make the software become +more messy, more complex, and more difficult to understand and maintain. +The extra effort required to make changes in the future is the interest paid on the (technical) debt. +It is natural for software to accrue some technical debt, +but it is important to pay off that debt during a maintenance phase - +simplifying, clarifying the code, making it easier to understand - +to keep these interest payments on making changes manageable. + +There is only so much time available in a project. +How much effort should we spend on designing our code properly +and using good development practices? +The following [XKCD comic](https://xkcd.com/844/) summarises this tension: + +![Writing good code comic](../fig/xkcd-good-code-comic.png){: .image-with-shadow width="400px" } + +At an intermediate level there are a wealth of practices that *could* be used, +and applying suitable design and coding practices is what separates +an *intermediate developer* from someone who has just started coding. +The key for an intermediate developer is to balance these concerns +for each software project appropriately, +and employ design and development practices *enough* so that progress can be made. +It is very easy to under-design software, +but remember it is also possible to over-design software too. + +## Techniques for Improving Code + +How code is structured is important for helping people who are developing and maintaining it +to understand and update it. +By breaking down our software into components with a single responsibility, +we avoid having to rewrite it all when requirements change. +Such components can be as small as a single function, or be a software package in their own right. +These smaller components can be understood individually without having to understand +the entire codebase at once. + +### Code Refactoring + +*Code refactoring* is the process of improving the design of an existing code - +changing the internal structure of code without changing its +external behavior, with the goal of making the code more readable, maintainable, efficient or easier +to test. +This can include things such as renaming variables, reorganising +functions to avoid code duplication and increase reuse, and simplifying conditional statements. + +### Code Decoupling + +*Code decoupling* is a code design technique that involves breaking a (complex) +software system into smaller, more manageable parts, and reducing the interdependence +between these different parts of the system. +This means that a change in one part of the code usually does not require a change in the other, +thereby making its development more efficient and less error prone. + +### Code Abstraction + +*Abstraction* is the process of hiding the implementation details of a piece of +code (typically behind an interface) - i.e. the details of *how* something works are hidden away, +leaving code developers to deal only with *what* it does. +This allows developers to work with the code at a higher level +of abstraction, without needing to understand fully (or keep in mind) all the underlying +details at any given time and thereby reducing the cognitive load when programming. + +Abstraction can be achieved through techniques such as *encapsulation*, *inheritance*, and +*polymorphism*, which we will explore in the next episodes. There are other [abstraction techniques](https://en.wikipedia.org/wiki/Abstraction_(computer_science)) +available too. + +## Improving Our Software Design + +Refactoring our code to make it more decoupled and to introduce abstractions to +hide all but the relevant information about parts of the code is important for creating more +maintainable code. +It will help to keep our codebase clean, modular and easier to understand. + +Writing good code is hard and takes practise. +You may also be faced with an existing piece of code that breaks some (or all) of the +good code principles, and your job will be to improve/refactor it so that it can evolve further. +We will now look into some examples of the techniques that can help us redesign our code +and incrementally improve its quality. + +{% include links.md %} diff --git a/_extras/refactor-2-code-refactoring.md b/_extras/refactor-2-code-refactoring.md new file mode 100644 index 000000000..7ad37673f --- /dev/null +++ b/_extras/refactor-2-code-refactoring.md @@ -0,0 +1,409 @@ +--- +title: "Refactor 2: Code Refactoring" +teaching: 30 +exercises: 20 +questions: +- "How do you refactor code without breaking it?" +- "What is decoupled code?" +- "What are benefits of using pure functions in our code?" +objectives: +- "Understand the benefits of code decoupling." +- "Understand the use of regressions tests to avoid breaking existing code when refactoring." +- "Understand the use of pure functions in software design to make the code easier to test." +- "Refactor a piece of code to separate out 'pure' from 'impure' code." +keypoints: +- "Implementing regression tests before refactoring gives you confidence that your changes have not +broken the code." +- "Decoupling code into pure functions that process data without side effects makes code easier +to read, test and maintain." +--- + +## Introduction + +*Code refactoring* is the process of improving the design of an existing code - for example +to make it more decoupled. +Recall that *code decoupling* means breaking the system into smaller components and reducing the +interdependence between these components, so that they can be tested and maintained independently. +Two components of code can be considered **decoupled** if a change in one does not +necessitate a change in the other. +While two connected units cannot always be totally decoupled, **loose coupling** +is something we should aim for. Benefits of decoupled code include: + +* easier to read as you do not need to understand the + details of the other component. +* easier to test, as one of the components can be replaced + by a test or a mock version of it. +* code tends to be easier to maintain, as changes can be isolated + from other parts of the code. + +When faced with an existing piece of code that needs modifying a good refactoring +process to follow is: + +1. Make sure you have tests that verify the current behaviour +2. Refactor the code +3. Verify that that the behaviour of the code is identical to that before refactoring. + +In this episode we will refactor the function `analyse_data()` in `compute_data.py` +from our project in the following two ways: +* add more tests so we can be more confident that future changes will have the +intended effect and will not break the existing code. +* split the monolithic `analyse_data()` function into a number of smaller and mode decoupled functions +making the code easier to understand and test. + +## Writing Tests Before Refactoring + +When refactoring, first we need to make sure there are tests that verity +the code behaviour as it is now (or write them if they are missing), +then refactor the code and, finally, check that the original tests still pass. +This is to make sure we do not break the existing behaviour through refactoring. + +There is a bit of a "chicken and egg" problem here - if the refactoring is supposed to make it easier +to write tests in the future, how can we write tests before doing the refactoring? +The tricks to get around this trap are: + + * Test at a higher level, with coarser accuracy + * Write tests that you intend to remove + +The best tests are ones that test single bits of functionality rigorously. +However, with our current `analyse_data()` code that is not possible because it is a +large function doing a little bit of everything. +Instead we will make minimal changes to the code to make it a bit more testable. + +Firstly, +we will modify the function to return the data instead of visualising it because graphs are harder +to test automatically (i.e. they need to be viewed and inspected manually in order to determine +their correctness). +Next, we will make the assert statements verify what the outcome is +currently, rather than checking whether that is correct or not. +Such tests are meant to +verify that the behaviour does not *change* rather than checking the current behaviour is correct +(there should be another set of tests checking the correctness). +This kind of testing is called **regression testing** as we are testing for +regressions in existing behaviour. + +Refactoring code is not meant to change its behaviour, but sometimes to make it possible to verify +you not changing the important behaviour you have to make small tweaks to the code to write +the tests at all. + +> ## Exercise: Write Regression Tests +> Modify the `analyse_data()` function not to plot a graph and return the data instead. +> Then, add a new test file called `test_compute_data.py` in the `tests` folder and +> add a regression test to verify the current output of `analyse_data()`. We will use this test +> in the remainder of this section to verify the output `analyse_data()` is unchanged each time +> we refactor or change code in the future. +> +> Start from the skeleton test code below: +> +> ```python +> def test_analyse_data(): +> from catchment.compute_data import analyse_data +> path = Path.cwd() / "data" +> result = analyse_data(path) +> +> # TODO: add an assert for the value of result +> ``` +> Use `assert_array_almost_equal` from the `numpy.testing` library to +> compare arrays of floating point numbers. +> +> Remember to run the test using `python -m pytest` from the project base directory: +> ```bash +> python -m pytest tests/test_compute_data.py +> ``` +> +>> ## Hint +>> When determining the correct return data result to use in tests, it may be helpful to assert the +>> result equals some random made-up data, observe the test fail initially and then +>> copy and paste the correct result into the test. +>> +>> Remember also that NaN values can be defined using the numpy library (`numpy.nan`). +> {: .solution} +> +>> ## Solution +>> One approach we can take is to: +>> * comment out the visualize method on `analyse_data()` +>> (as this will cause our test to hang waiting for the result data) +>> * return the data instead, so we can write asserts on the data +>> * See what the calculated value is, and assert that it is the same as the expected value +>> +>> Putting this together, your test may look like: +>> +>> ```python +>> import numpy as np +>> import numpy.testing as npt +>> from pathlib import Path +>> +>> def test_analyse_data(): +>> from catchment.compute_data import analyse_data +>> path = Path.cwd() / "data" +>> result = analyse_data(path) +>> expected_output = [ [0. , 0.18801829], +>> [0.10978448, 0.43107373], +>> [0.06066156, 0.0699624 ], +>> [0. , 0.02041241], +>> [0. , 0. ], +>> [0. , 0.02871518], +>> [0. , 0.17227833], +>> [0. , 0.04866643], +>> [0. , 0.02041241], +>> [0.88952727, 0. ], +>> [0. , 0.02041241], +>> [0. , 0. ], +>> [0.02041241, 0. ], +>> [0. , 0. ], +>> [0. , 0. ], +>> [0. , 0. ], +>> [0. , 0. ], +>> [0.0349812 , 0.02041241], +>> [0.02871518, 0.02041241], +>> [0.02041241, 0. ], +>> [0.02041241, 0. ], +>> [0. , 0.02041241], +>> [0. , 0. ], +>> [0. , np.nan], +>> [0.02041241, 0. ], +>> [0. , 0.02041241], +>> [0. , 0.02041241], +>> [0.02041241, 0. ], +>> [0.13449059, 0. ], +>> [0.18285024, 0.19707288], +>> [0.19176008, 0.13915472]] +>> npt.assert_array_almost_equal(result, expected_output) +>> ``` +>> +>> Note that while the above test will detect if we accidentally break the analysis code and +>> change the output of the analysis, is not a good or complete test for the following reasons: +>> * It is not at all obvious why the `expected_output` is correct +>> * It does not test edge cases +>> * If the data files in the directory change - the test will fail +>> +>> We would need additional tests to check the above. +> {: .solution} +{: .challenge} + +## Separating Pure and Impure Code + +Now that we have our regression test for `analyse_data()` in place, we are ready to refactor the +function further. +We would like to separate out as much of its code as possible as *pure functions*. +Pure functions are very useful and much easier to test as they take input only from its input +parameters and output only via their return values. + +### Pure Functions + +A pure function in programming works like a mathematical function - +it takes in some input and produces an output and that output is +always the same for the same input. +That is, the output of a pure function does not depend on any information +which is not present in the input (such as global variables). +Furthermore, pure functions do not cause any *side effects* - they do not modify the input data +or data that exist outside the function (such as printing text, writing to a file or +changing a global variable). They perform actions that affect nothing but the value they return. + +### Benefits of Pure Functions + +Pure functions are easier to understand because they eliminate side effects. +The reader only needs to concern themselves with the input +parameters of the function and the function code itself, rather than +the overall context the function is operating in. +Similarly, a function that calls a pure function is also easier +to understand - we only need to understand what the function returns, which will probably +be clear from the context in which the function is called. +Finally, pure functions are easier to reuse as the caller +only needs to understand what parameters to provide, rather +than anything else that might need to be configured prior to the call. +For these reasons, you should try and have as much of the complex, analytical and mathematical +code are pure functions. + + +Some parts of a program are inevitably impure. +Programs need to read input from users, generate a graph, or write results to a file or a database. +Well designed programs separate complex logic from the necessary impure "glue" code that +interacts with users and other systems. +This way, you have easy-to-read and easy-to-test pure code that contains the complex logic +and simplified impure code that reads data from a file or gathers user input. Impure code may +be harder to test but, when simplified like this, may only require a handful of tests anyway. + +> ## Exercise: Refactoring To Use a Pure Function +> Refactor the `analyse_data()` function to delegate the data analysis to a new +> pure function `compute_standard_deviation_by_day()` and separate it +> from the impure code that handles the input and output. +> The pure function should take in the data, and return the analysis result, as follows: +> ```python +> def compute_standard_deviation_by_day(data): +> # TODO +> return daily_standard_deviation +> ``` +>> ## Solution +>> The analysis code will be refactored into a separate function that may look something like: +>> ```python +>>def compute_standard_deviation_by_day(data): +>> daily_std_list = [] +>> for dataset in data: +>> daily_std = dataset.groupby(dataset.index.date).std() +>> daily_std_list.append(daily_std) +>> +>> daily_standard_deviation = pd.concat(daily_std_list) +>> return daily_standard_deviation +>> ``` +>> The `analyse_data()` function now calls the `compute_standard_deviation_by_day()` function, +>> while keeping all the logic for reading the data, processing it and showing it in a graph: +>>```python +>>def analyse_data(data_dir): +>> """Calculate the standard deviation by day between datasets. +>> +>> Gets all the measurement data from the CSV files in the data directory, +>> works out the mean for each day, and then graphs the standard deviation +>> of these means. +>> """ +>> data_file_paths = glob.glob(os.path.join(data_dir, 'rain_data_2015*.csv')) +>> if len(data_file_paths) == 0: +>> raise ValueError('No CSV files found in the data directory') +>> data = map(models.read_variable_from_csv, data_file_paths) +>> daily_standard_deviation = compute_standard_deviation_by_day(data) +>> +>> graph_data = { +>> 'standard deviation by day': daily_standard_deviation, +>> } +>> # views.visualize(graph_data) +>> return daily_standard_deviation +>>``` +>> Make sure to re-run the regression test to check this refactoring has not +>> changed the output of `analyse_data()`. +> {: .solution} +{: .challenge} + +> ## Mapping +> `map(f, C)` is a function that takes another function `f()` +> and a collection `C` of data items as inputs. +> Calling `map(f, C)` applies the function `f(x)` to every data item `x` in a collection `C` +> and returns the resulting values as a new collection of the same size. +> +> This is a simple mapping that takes a list of names and +> returns a list of the lengths of those names using the built-in function `len()`: +> ```python +> name_lengths = map(len, ["Mary", "Isla", "Sam"]) +> print(list(name_lengths)) +> ``` +> ```output +> [4, 4, 3] +> ``` +> For more information on mapping functions, and how they can be combined with reduce +> functions, see the [Functional Programming](/34-functional-programming/index.html) episode. +{: .callout} + +> ## Exercise: Mapping +> Identify a line of code in the `analyse_data` function which uses the `map` function. +>> ## Solution +>> The `map` function is used with the `read_variables_from_csv` function in the `catchment/models.py` module. +>> It creates a collection of dataframes containing the data within files defined in the list `data_file_paths`: +>> ```python +>> data = map(models.read_variable_from_csv, data_file_paths) +>> ``` +> {: .solution} +> +> Now create a pure function, `daily_std`, to calculate the standard deviation by day for any dataframe. +> This can take a similar form to the `daily_mean` and `daily_max` functions in the `catchment/models.py` file. +> +> Then replace the `for` loop below, that is in your `compute_standard_deviation_by_day` function, +> with a `map()` function that uses the `daily_std` function to calculate the daily standard +> deviation. +> ```python +> daily_std_list = [] +> for dataset in data: +> daily_std = dataset.groupby(dataset.index.date).std() +> daily_std_list.append(daily_std) +> ``` +>> ## Solution +>> The final functions could look like: +>> ```python +>> def daily_std(data): +>> return data.groupby(data.index.date).std() +>> +>> +>> def compute_standard_deviation_by_day(data): +>> daily_std_list = map(daily_std, data) +>> +>> daily_standard_deviation = pd.concat(daily_std_list) +>> return daily_standard_deviation +>> ``` +>> +> {: .solution} +{: .challenge} + +### Testing Pure Functions + +Now we have our analysis implemented as a pure function, we can write tests that cover +all the things we would like to check without depending on CSVs files. +This is another advantage of pure functions - they are very well suited to automated testing, +i.e. their tests are: +* **easier to write** - we construct input and assert the output +without having to think about making sure the global state is correct before or after +* **easier to read** - the reader will not have to open a CSV file to understand why +the test is correct +* **easier to maintain** - if at some point the data format changes +from CSV to JSON, the bulk of the tests need not be updated + +> ## Exercise: Testing a Pure Function +> Add tests for `compute_standard_deviation_by_day()` that check for situations +> when there is only one file with multiple sites, +> multiple files with one site, and any other cases you can think of that should be tested. +>> ## Solution +>> You might have thought of more tests, but we can easily extend the test by parametrizing +>> with more inputs and expected outputs: +>> ```python +>>@pytest.mark.parametrize( +>> "data, expected_output", +>> [ +>> ( +>> [pd.DataFrame(data=[ [1.0, 0.0], [3.0, 4.0], [5.0, 8.0] ], +>> index=[ pd.to_datetime('2000-01-01 01:00'), +>> pd.to_datetime('2000-01-01 02:00'), +>> pd.to_datetime('2000-01-01 03:00') ], +>> columns=[ 'A', 'B' ])], +>> [ [2.0, 4.0] ] +>> ), +>> ( +>> [pd.DataFrame(data=[ 1.0, 3.0, 5.0 ], +>> index=[ pd.to_datetime('2000-01-01 01:00'), +>> pd.to_datetime('2000-01-01 02:00'), +>> pd.to_datetime('2000-01-01 03:00') ], +>> columns=['A']), +>> pd.DataFrame(data=[ 0.0, 4.0, 8.0 ], +>> index=[ pd.to_datetime('2000-01-01 01:00'), +>> pd.to_datetime('2000-01-01 02:00'), +>> pd.to_datetime('2000-01-01 03:00') ], +>> columns=['B']) ], +>> [ [2.0, 4.0] ] +>> ) +>> ], ids=["two datasets in same dataframe", "two datasets in two different dataframes"]) +>>def test_compute_standard_deviation_by_day(data, expected_output): +>> from catchment.compute_data import compute_standard_deviation_by_day +>> +>> result = compute_standard_deviation_by_day(data) +>> npt.assert_array_almost_equal(result, expected_output) +``` +> {: .solution} +{: .challenge} + +> ## Functional Programming +> **Functional programming** is a programming paradigm where programs are constructed by +> applying and composing/chaining pure functions. +> Some programming languages, such as Haskell or Lisp, support writing pure functional code only. +> Other languages, such as Python, Java, C++, allow mixing **functional** and **procedural** +> programming paradigms. +> Read more in the [extra episode on functional programming](/34-functional-programming/index.html) +> and when it can be very useful to switch to this paradigm +> (e.g. to employ MapReduce approach for data processing). +{: .callout} + + +There are no definite rules in software design but making your complex logic out of +composed pure functions is a great place to start when trying to make your code readable, +testable and maintainable. This is particularly useful for: + +* Data processing and analysis +(for example, using [Python Pandas library](https://pandas.pydata.org/) for data manipulation where most of functions appear pure) +* Doing simulations +* Translating data from one format to another + +{% include links.md %} diff --git a/_extras/refactor-3-code-abstractions.md b/_extras/refactor-3-code-abstractions.md new file mode 100644 index 000000000..4a3996256 --- /dev/null +++ b/_extras/refactor-3-code-abstractions.md @@ -0,0 +1,482 @@ +--- +title: "Refactor 3: Code Abstractions" +teaching: 30 +exercises: 45 +questions: +- "When is it useful to use classes to structure code?" +- "How can we make sure the components of our software are reusable?" +objectives: +- "Introduce appropriate abstractions to simplify code." +- "Understand the principles of encapsulation, polymorphism and interfaces." +- "Use mocks to replace a class in test code." +keypoints: +- "Classes and interfaces can help decouple code so it is easier to understand, test and maintain." +- "Encapsulation is bundling related data into a structured component, +along with the methods that operate on the data. It is also provides a mechanism for restricting +the access to that data, hiding the internal representation of the component." +- "Polymorphism describes the provision of a single interface to entities of different types, +or the use of a single symbol to represent different types." +--- + +## Introduction + +*Code abstraction* is the process of hiding the implementation details of a piece of +code behind an interface - i.e. the details of *how* something works are hidden away, +leaving us to deal only with *what* it does. +This allows developers to work with the code at a higher level +of abstraction, without needing to understand fully (or keep in mind) all the underlying +details and thereby reducing the cognitive load when programming. + +Abstractions can aid decoupling of code. +If one part of the code only uses another part through an appropriate abstraction +then it becomes easier for these parts to change independently. + +Let's start redesigning our code by introducing some of the abstraction techniques +to incrementally improve its design. + +You may have noticed that loading data from CSV files in a directory is "baked" into +(i.e. is part of) the `analyse_data()` function. +This is not strictly a functionality of the data analysis function, so firstly +let's decouple the data loading into a separate function. + +> ## Exercise: Decouple Data Loading from Data Analysis +> Separate out the data loading functionality from `analyse_data()` into a new function +> `load_catchment_data()` that returns all the files to load. +>> ## Solution +>> The new function `load_catchment_data()` that reads all the data into the format needed +>> for the analysis should look something like: +>> ```python +>> def load_inflammation_data(dir_path): +>> data_file_paths = glob.glob(os.path.join(dir_path, 'rain_data_2015*.csv')) +>> if len(data_file_paths) == 0: +>> raise ValueError('No CSV files found in the data directory') +>> data = map(models.load_csv, data_file_paths) +>> return list(data) +>> ``` +>> This function can now be used in the analysis as follows: +>> ```python +>> def analyse_data(data_dir): +>> data = load_catchment_data(data_dir) +>> daily_standard_deviation = compute_standard_deviation_by_data(data) +>> ... +>> ``` +>> The code is now easier to follow since we do not need to understand the the data loading from +>> files to read the statistical analysis, and vice versa - we do not have to understand the +>> statistical analysis when looking at data loading. +>> Ensure you re-run the regression tests to check this refactoring has not +>> changed the output of `analyse_data()`. +> {: .solution} +{: .challenge} + +However, even with this change, the data loading is still coupled with the data analysis. +For example, if we have to support loading data from different sources +(e.g. JSON files and CSV files), we would have to pass some kind of a flag indicating +what we want into `analyse_data()`. Instead, we would like to decouple the +consideration of what data to load from the `analyse_data()` function entirely. +One way we can do this is by using *encapsulation* and *classes*. + +## Encapsulation & Classes + +*Encapsulation* is the packing of "data" and "functions operating on that data" into a +single component/object. +It is also provides a mechanism for restricting the access to that data. +Encapsulation means that the internal representation of a component is generally hidden +from view outside of the component's definition. + +Encapsulation allows developers to present a consistent interface to an object/component +that is independent of its internal implementation. +For example, encapsulation can be used to hide the values or +state of a structured data object inside a **class**, preventing direct access to them +that could violate the object's state maintained by the class' methods. +Note that object-oriented programming (OOP) languages support encapsulation, +but encapsulation is not unique to OOP. + +So, a class is a way of grouping together data with some methods that manipulate that data. +In Python, you can *declare* a class as follows: + +```python +class Circle: + pass +``` + +Classes are typically named using "CapitalisedWords" naming convention - e.g. FileReader, +OutputStream, Rectangle. + +You can *construct* an *instance* of a class elsewhere in the code by doing the following: + +```python +my_circle = Circle() +``` + +When you construct a class in this ways, the class' *constructor* method is called. +It is also possible to pass values to the constructor in order to configure the class instance: + +```python +class Circle: + def __init__(self, radius): + self.radius = radius + +my_circle = Circle(10) +``` + +The constructor has the special name `__init__`. +Note it has a special first parameter called `self` by convention - it is +used to access the current *instance* of the object being created. + +A class can be thought of as a cookie cutter template, and instances as the cookies themselves. +That is, one class can have many instances. + +Classes can also have other methods defined on them. +Like constructors, they have the special parameter `self` that must come first. + +```python +import math + +class Circle: + ... + def get_area(self): + return math.pi * self.radius * self.radius +... +print(my_circle.get_area()) +``` + +On the last line of the code above, the instance of the class, `my_circle`, will be automatically +passed as the first parameter (`self`) when calling the `get_area()` method. +The `get_area()` method can then access the variable `radius` encapsulated within the object, which +is otherwise invisible to the world outside of the object. +The method `get_area()` itself can also be accessed via the object/instance only. + +As we can see, internal representation of any instance of class `Circle` is hidden +outside of this class (encapsulation). +In addition, implementation of the method `get_area()` is hidden too (abstraction). + +> ## Encapsulation & Abstraction +> Encapsulation provides **information hiding**. Abstraction provides **implementation hiding**. +{: .callout} + +> ## Exercise: Use Classes to Abstract out Data Loading +> Declare a new class `CSVDataSource` that contains the `load_catchment_data` function +> we wrote in the previous exercise as a method of this class. +> The directory path where to load the files from should be passed in the class' constructor method. +> Finally, construct an instance of the class `CSVDataSource` outside the statistical +> analysis and pass it to `analyse_data()` function. +>> ## Hint +>> At the end of this exercise, the code in the `analyse_data()` function should look like: +>> ```python +>> def analyse_data(data_source): +>> data = data_source.load_catchment_data() +>> daily_standard_deviation = compute_standard_deviation_by_data(data) +>> ... +>> ``` +>> The controller code should look like: +>> ```python +>> data_source = compute_data.CSVDataSource(os.path.dirname(InFiles[0])) +>> compute_data.analyse_data(data_source) +>> ``` +> {: .solution} +>> ## Solution +>> For example, we can declare class `CSVDataSource` like this: +>> +>> ```python +>> class CSVDataSource: +>> """ +>> Loads all the catchment CSV files within a specified directory. +>> """ +>> def __init__(self, dir_path): +>> self.dir_path = dir_path +>> +>> def load_catchment_data(self): +>> data_file_paths = glob.glob(os.path.join(self.dir_path, 'rain_data_2015*.csv')) +>> if len(data_file_paths) == 0: +>> raise ValueError('No CSV files found in the data directory') +>> data = map(models.read_variable_from_csv, data_file_paths) +>> return list(data) +>> ``` +>> In the controller, we create an instance of CSVDataSource and pass it +>> into the the statistical analysis function. +>> +>> ```python +>> data_source = CSVDataSource(os.path.dirname(InFiles[0])) +>> analyse_data(data_source) +>> ``` +>> The `analyse_data()` function is modified to receive any data source object (that implements +>> the `load_catchment_data()` method) as a parameter. +>> ```python +>> def analyse_data(data_source): +>> data = data_source.load_catchment_data() +>> daily_standard_deviation = compute_standard_deviation_by_data(data) +>> ... +>> ``` +>> We have now fully decoupled the reading of the data from the statistical analysis and +>> the analysis is not fixed to reading from a directory of CSV files. Indeed, we can pass various +>> data sources to this function now, as long as they implement the `load_catchment_data()` +>> method. +>> +>> While the overall behaviour of the code and its results are unchanged, +>> the way we invoke data analysis has changed. +>> We must update our regression test to match this, to ensure we have not broken anything: +>> ```python +>> ... +>> def test_compute_data(): +>> from catchment.compute_data import analyse_data, CSVDataSource +>> path = Path.cwd() / "../data" +>> data_source = CSVDataSource(path) +>> result = analyse_data(data_source) +>> expected_output = [ [0. , 0.18801829], +>> ... +>> ``` +> {: .solution} +{: .challenge} + + +## Interfaces + +An interface is another important concept in software design related to abstraction and +encapsulation. For a software component, it declares the operations that can be invoked on +that component, along with input arguments and what it returns. By knowing these details, +we can communicate with this component without the need to know how it implements this interface. + +API (Application Programming Interface) is one example of an interface that allows separate +systems (external to one another) to communicate with each other. +For example, a request to Google Maps service API may get +you the latitude and longitude for a given address. +Twitter API may return all tweets that contain +a given keyword that have been posted within a certain date range. + +Internal interfaces within software dictate how +different parts of the system interact with each other. +Even when these are not explicitly documented or thought out, they still exist. + +For example, our `Circle` class implicitly has an interface - you can call `get_area()` method +on it and it will return a number representing its surface area. + +> ## Exercise: Identify an Interface Between `CSVDataSource` and `analyse_data` +> What is the interface between CSVDataSource class and `analyse_data()` function. +> Think about what functions `analyse_data()` needs to be able to call to perform its duty, +> what parameters they need and what they return. +>> ## Solution +>> The interface is the `load_catchment_data()` method, which takes no parameters and +>> returns a list where each entry is a 2D array of catchment measurement data (read from some +>> data source). +>> +>> Any object passed into `analyse_data()` should conform to this interface. +> {: .solution} +{: .challenge} + + +## Polymorphism + +In general, polymorphism is the idea of having multiple implementations/forms/shapes +of the same abstract concept. +It is the provision of a single interface to entities of different types, +or the use of a single symbol to represent multiple different types. + +There are [different versions of polymorphism](https://www.bmc.com/blogs/polymorphism-programming/). +For example, method or operator overloading is one +type of polymorphism enabling methods and operators to take parameters of different types. + +We will have a look at the interface-based polymorphism. +In OOP, it is possible to have different object classes that conform to the same interface. +For example, let's have a look at the following class representing a `Rectangle`: + +```python +class Rectangle: + def __init__(self, width, height): + self.width = width + self.height = height + def get_area(self): + return self.width * self.height +``` + +Like `Circle`, this class provides the `get_area()` method. +The method takes the same number of parameters (none), and returns a number. +However, the implementation is different. This is one type of *polymorphism*. + +The word "polymorphism" means "many forms", and in programming it refers to +methods/functions/operators with the same name that can be executed on many objects or classes. + +Using our `Circle` and `Rectangle` classes, we can create a list of different shapes and iterate +through the list to find their total surface area as follows: + +```python +my_circle = Circle(radius=10) +my_rectangle = Rectangle(width=5, height=3) +my_shapes = [my_circle, my_rectangle] +total_area = sum(shape.get_area() for shape in my_shapes) +``` + +Note that we have not created a common superclass or linked the classes `Circle` and `Rectangle` +together in any way. It is possible due to polymorphism. +You could also say that, when we are calculating the total surface area, +the method for calculating the area of each shape is abstracted away to the relevant class. + +How can polymorphism be useful in our software project? +For example, we can replace our `CSVDataSource` with another class that reads a totally +different file format (e.g. JSON instead of CSV), or reads from an external service or database +All of these changes can be now be made without changing the analysis function as we have decoupled +the process of data loading from the data analysis earlier. +Conversely, if we wanted to write a new analysis function, we could support any of these +data sources with no extra work. + +> ## Exercise: Add an Additional DataSource +> Create another class that supports loading catchment data from JSON files, with the +> appropriate `load_catchment_data()` method. +> There is a function in `models.py` that loads from JSON in the following format: +> ```json +>[ +> { +> "Site": "FP35", +> "Site Name": "Lower Wraxall Farm", +> "Date": "01/12/2008 23:00", +> "Rainfall (mm)": 0.0 +> }, +> { +> "Site": "FP35", +> "Site Name": "Lower Wraxall Farm", +> "Date": "01/12/2008 23:15", +> "Rainfall (mm)": 0.0 +> } +> ] +> ``` +> Finally, at run time construct an appropriate instance based on the file extension. +>> ## Solution +>> The new class could look something like: +>> ```python +>> class JSONDataSource: +>> """ +>> Loads patient data with catchment values from JSON files within a specified folder. +>> """ +>> def __init__(self, dir_path): +>> self.dir_path = dir_path +>> +>> def load_catchment_data(self): +>> data_file_paths = glob.glob(os.path.join(self.dir_path, 'rain_data_2015*.json')) +>> if len(data_file_paths) == 0: +>> raise ValueError('No JSON files found in the data directory') +>> data = map(models.load_json, data_file_paths) +>> return list(data) +>> ``` +>> Additionally, in the controller will need to select the appropriate DataSource to +>> provide to the analysis: +>>```python +>> _, extension = os.path.splitext(InFiles[0]) +>> if extension == '.json': +>> data_source = JSONDataSource(os.path.dirname(InFiles[0])) +>> elif extension == '.csv': +>> data_source = CSVDataSource(os.path.dirname(InFiles[0])) +>> else: +>> raise ValueError(f'Unsupported file format: {extension}') +>> analyse_data(data_source) +>>``` +>> As you can seen, all the above changes have been made made without modifying +>> the analysis code itself. +> {: .solution} +{: .challenge} + +## Testing Using Mock Objects + +We can use this abstraction to also make testing more straight forward. +Instead of having our tests use real file system data, we can instead provide +a mock or dummy implementation instead of one of the real classes. +Providing that what we use as a substitute conforms to the same interface, +the code we are testing should work just the same. +Such mock/dummy implementation could just returns some fixed example data. + +An convenient way to do this in Python is using Python's [mock object library](https://docs.python.org/3/library/unittest.mock.html). +This is a whole topic in itself - +but a basic mock can be constructed using a couple of lines of code: + +```python +from unittest.mock import Mock + +mock_version = Mock() +mock_version.method_to_mock.return_value = 42 +``` + +Here we construct a mock in the same way you would construct a class. +Then we specify a method that we want to behave a specific way. + +Now whenever you call `mock_version.method_to_mock()` the return value will be `42`. + + +> ## Exercise: Test Using a Mock Implementation +> Complete this test for `analyse_data()`, using a mock object in place of the +> `data_source`: +> ```python +> from unittest.mock import Mock +> +> def test_compute_data_mock_source(): +> from catchment.compute_data import analyse_data +> data_source = Mock() +> +> # TODO: configure data_source mock +> +> result = analyse_data(data_source) +> +> # TODO: add assert on the contents of result +> ``` +> Create a mock that returns some fixed data and to use as the `data_source` in order to test +> the `analyse_data` method. +> Use this mock in a test. +> +> Do not forget to import `Mock` from the `unittest.mock` package. +>> ## Solution +>> ```python +>> from unittest.mock import Mock +>> +>> def test_compute_data_mock_source(): +>> from catchment.compute_data import analyse_data +>> data_source = Mock() +>> +>> data_source.load_catchment_data.return_value = [pd.DataFrame( +>> data=[[1.0, 1.0], +>> [2.0, 1.0], +>> [4.0, 2.0]], +>> index=[pd.to_datetime('2000-01-01 01:00'), +>> pd.to_datetime('2000-01-01 02:00'), +>> pd.to_datetime('2000-01-01 03:00')], +>> columns=['A', 'B'] +>> )] +>> +>> result = analyse_data(data_source) +>> npt.assert_array_almost_equal(result, [[1.527525, 0.57735 ]]) +>> ``` +> {: .solution} +{: .challenge} + +## Programming Paradigms + +Until now, we have mainly been writing procedural code. +In the previous episode, we mentioned [pure functions](/33-code-refactoring/index.html#pure-functions) +and Functional Programming. +In this episode, we have touched a bit upon classes, encapsulation and polymorphism, +which are characteristics of (but not limited to) the Object Oriented Programming (OOP). +All these different programming paradigms provide varied approaches to structuring your code - +each with certain strengths and weaknesses when used to solve particular types of problems. +In many cases, particularly with modern languages, a single language can allow many different +structural approaches and mixing programming paradigms within your code. +Once your software begins to get more complex - it is common to use aspects of [different paradigm](/programming-paradigms/index.html) +to handle different subtasks. +Because of this, it is useful to know about the [major paradigms](/programming-paradigms/index.html), +so you can recognise where it might be useful to switch. +This is outside of scope of this course - we have some extra episodes on the topics of +[Procedural Programming](/programming-paradigms/index.html#procedural-programming), +[Functional Programming](/functional-programming/index.html) and +[Object Oriented Programming](/object-oriented-programming/index.html) if you want to know more. + +> ## So Which One is Python? +> Python is a multi-paradigm and multi-purpose programming language. +> You can use it as a procedural language and you can use it in a more object oriented way. +> It does tend to land more on the object oriented side as all its core data types +> (strings, integers, floats, booleans, lists, +> sets, arrays, tuples, dictionaries, files) +> as well as functions, modules and classes are objects. +> +> Since functions in Python are also objects that can be passed around like any other object, +> Python is also well suited to functional programming. +> One of the most popular Python libraries for data manipulation, +> [Pandas](https://pandas.pydata.org/) (built on top of NumPy), +> supports a functional programming style +> as most of its functions on data are not changing the data (no side effects) +> but producing a new data to reflect the result of the function. +{: .callout} diff --git a/_extras/refactor-4-architecture-revisited.md b/_extras/refactor-4-architecture-revisited.md new file mode 100644 index 000000000..660ddda11 --- /dev/null +++ b/_extras/refactor-4-architecture-revisited.md @@ -0,0 +1,570 @@ +--- +title: "Refactor 4: Architecture Revisited: Extending Software" +teaching: 15 +exercises: 0 +questions: +- "How can we extend our software within the constraints of the MVC architecture?" +objectives: +- "Extend our software to add a view of a single patient in the study and the software's command line interface to request a specific view." +keypoints: +- "By breaking down our software into components with a single responsibility, we avoid having to rewrite it all when requirements change. + Such components can be as small as a single function, or be a software package in their own right." +--- + +As we have seen, we have different programming paradigms that are suitable for different problems +and affect the structure of our code. +In programming languages that support multiple paradigms, such as Python, +we have the luxury of using elements of different paradigms paradigms and we, +as software designers and programmers, +can decide how to use those elements in different architectural components of our software. +Let's now circle back to the architecture of our software for one final look. + +## MVC Revisited + +We've been developing our software using the **Model-View-Controller** (MVC) architecture so far, +but, as we have seen, MVC is just one of the common architectural patterns +and is not the only choice we could have made. + +### Separation of Responsibilities + +Separation of responsibilities is important when designing software architectures +in order to reduce the code's complexity and increase its maintainability. +Note, however, there are limits to everything - +and MVC architecture is no exception. +Controller often transcends into Model and View +and a clear separation is sometimes difficult to maintain. +For example, the Command Line Interface provides both the View +(what user sees and how they interact with the command line) +and the Controller (invoking of a command) aspects of a CLI application. +In Web applications, Controller often manipulates the data (received from the Model) +before displaying it to the user or passing it from the user to the Model. + +There are many variants of an MVC-like pattern (such as +[Model-View-Presenter](https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93presenter) (MVP), +[Model-View-Viewmodel](https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93viewmodel) (MVVM), etc.), +but in most cases, the distinction between these patterns isn't particularly important. +What really matters is that we are making decisions about the architecture of our software +that suit the way in which we expect to use it. +We should reuse these established ideas where we can, but we don't need to stick to them exactly. + +The key thing to take away is the distinction between the Model and the View code, while +the View and the Controller can be more or less coupled together (e.g. the code that specifies +there is a button on the screen, might be the same code that specifies what that button does). +The View may be hard to test, or use special libraries to draw the UI, but should not contain any +complex logic, and is really just a presentation layer on top of the Model. +The Model, conversely, should not care how the data is displayed. +For example, the View may present dates as "Monday 24th July 2023", +but the Model stores it using a `Date` object rather than its string representation. + +## Our Project's Architecture (Revisited) + +Recall that in our software project, the **Controller** module is in `catchment-analysis.py`, +and the View and Model modules are contained in +`catchment/views.py` and `catchment/models.py`, respectively. +Data underlying the Model is contained within the directory `data`. + +Looking at the code in the branch `full-data-analysis` (where we should be currently located), +we can notice that the new code was added in a separate script `catchment/compute_data.py` and +contains a mix of Model, View and Controller code. + +> ## Exercise: Identify Model, View and Controller Parts of the Code +> Looking at the code inside `compute_data.py`, what parts could be considered +> Model, View and Controller code? +> +>> ## Solution +>> * Computing the standard deviation belongs to Model. +>> * Reading the data from CSV files also belongs to Model. +>> * Displaying of the output as a graph is View. +>> * The logic that processes the supplied files is Controller. +> {: .solution} +{: .challenge} + +Within the Model further separations make sense. +For example, as we did in the before, separating out the impure code that interacts with +the file system from the pure calculations helps with readability and testability. +Nevertheless, the MVC architectural pattern is a great starting point when thinking about +how you should structure your code. + +> ## Exercise: Split out the Model, View and Controller Code +> Refactor `analyse_data()` function so that the Model, View and Controller code +> we identified in the previous exercise is moved to appropriate modules. +>> ## Solution +>> The idea here is for the `analyse_data()` function not to have any "view" considerations. +>> That is, it should just compute and return the data and +>> should be located in `catchment/models.py`. +>> +>> ```python +>> def analyse_data(data_source): +>> """Calculate the standard deviation by day between datasets +>> Gets all the measurement data from the CSV files in the data directory, +>> works out the mean for each day, and then graphs the standard deviation +>> of these means. +>> """ +>> data = data_source.load_catchment_data() +>> daily_standard_deviation = compute_standard_deviation_by_data(data) +>> +>> return daily_standard_deviation +>> ``` +>> There can be a separate bit of code in the Controller `catchment-analysis.py` +>> that chooses how data should be presented, e.g. as a graph: +>> +>> ```python +>> if args.full_data_analysis: +>> _, extension = os.path.splitext(InFiles[0]) +>> if extension == '.json': +>> data_source = JSONDataSource(os.path.dirname(InFiles[0])) +>> elif extension == '.csv': +>> data_source = CSVDataSource(os.path.dirname(InFiles[0])) +>> else: +>> raise ValueError(f'Unsupported file format: {extension}') +>> data_result = analyse_data(data_source) +>> graph_data = { +>> 'daily standard deviation': data_result, +>> } +>> views.visualize(graph_data) +>> return +>> ``` +>> Note that this is, more or less, the change we did to write our regression test. +>> This demonstrates that splitting up Model code from View code can +>> immediately make your code much more testable. +>> Ensure you re-run our regression test to check this refactoring has not +>> changed the output of `analyse_data()`. +> {: .solution} +{: .challenge} + +At this point, you have refactored and tested all the code on branch `full-data-analysis` +and it is working as expected. The branch is ready to be incorporated into `develop` +and then, later on, `main`, which may also have been changed by other developers working on +the code at the same time so make sure to update accordingly or resolve any conflicts. + +~~~ +$ git switch develop +$ git merge full-data-analysis +~~~ +{: .language-bash} + +Let's now have a closer look at our Controller, and how can handling command line arguments in Python +(which is something you may find yourself doing often if you need to run the code from a +command line tool). + + +### Controller file structure + +You will have noticed already that structure of the `catchment-analysis.py` file +follows this pattern: + +~~~ +# import modules + +def main(): + # perform some actions + +if __name__ == "__main__": + # perform some actions before main() + main() +~~~ +{: .language-python} + +In this pattern the actions performed by the script are contained within the `main` function +(which does not need to be called `main`, +but using this convention helps others in understanding your code). +The `main` function is then called within the `if` statement `__name__ == "__main__"`, +after some other actions have been performed +(usually the parsing of command-line arguments, which will be explained below). +`__name__` is a special dunder variable which is set, +along with a number of other special dunder variables, +by the python interpreter before the execution of any code in the source file. +What value is given by the interpreter to `__name__` is determined by +the manner in which it is loaded. + +If we run the source file directly using the Python interpreter, e.g.: + +~~~ +python catchment-analysis.py +~~~ +{: .language-bash} +then the interpreter will assign the hard-coded string `"__main__"` to the `__name__` variable: + +~~~ +__name__ = "__main__" +... +# rest of your code +~~~ +{: .language-python} + +However, if your source file is imported by another Python script, e.g: + +~~~ +import catchment-analysis +~~~ +{: .language-python} + +then the interpreter will assign the name `"catchment-analysis"` +from the import statement to the `__name__` variable: + +~~~ +__name__ = "catchment-analysis" +... +# rest of your code +~~~ +{: .language-python} + +Because of this behaviour of the interpreter, +we can put any code that should only be executed when running the script +directly within the `if __name__ == "__main__":` structure, +allowing the rest of the code within the script to be +safely imported by another script if we so wish. + +While it may not seem very useful to have your controller script importable by another script, +there are a number of situations in which you would want to do this: + +- for testing of your code, you can have your testing framework import the main script, + and run special test functions which then call the `main` function directly; +- where you want to not only be able to run your script from the command-line, + but also provide a programmer-friendly application programming interface (API) for advanced users. + +### Passing Command-line Options to Controller + +The standard python library for reading command line arguments passed to a script is +[`argparse`](https://docs.python.org/3/library/argparse.html). +This module reads arguments passed by the system, +and enables the automatic generation of help and usage messages. +These include, as we saw at the start of this course, +the generation of helpful error messages when users give the program invalid arguments. + +The basic usage of `argparse` can be seen in the `catchment-analysis.py` script. +First we import the library: + +~~~ +import argparse +~~~ +{: .language-python} + +We then initialise the argument parser class, passing an (optional) description of the program: + +~~~ +parser = argparse.ArgumentParser( + description='A basic environmental data management system') +~~~ +{: .language-python} + +Once the parser has been initialised we can add +the arguments that we want argparse to look out for. +In our basic case, we want only the names of the file(s) to process: + +~~~ +parser.add_argument( + 'infiles', + nargs='+', + help='Input CSV(s) containing measurement data') +~~~ +{: .language-python} + +Here we have defined what the argument will be called (`'infiles'`) when it is read in; +the number of arguments to be expected +(`nargs='+'`, where `'+'` indicates that there should be 1 or more arguments passed); +and a help string for the user +(`help='Input CSV(s) containing measurement data'`). + +You can add as many arguments as you wish, +and these can be either mandatory (as the one above) or optional. +Most of the complexity in using `argparse` is in adding the correct argument options, +and we will explain how to do this in more detail below. + +Finally we parse the arguments passed to the script using: + +~~~ +args = parser.parse_args() +~~~ +{: .language-python} + +This returns an object (that we've called `arg`) containing all the arguments requested. +These can be accessed using the names that we have defined for each argument, +e.g. `args.infiles` would return the filenames that have been input. + +The help for the script can be accessed using the `-h` or `--help` optional argument +(which `argparse` includes by default): + +~~~ +python catchment-analysis.py --help +~~~ +{: .language-bash} +~~~ +usage: catchment-analysis.py [-h] infiles [infiles ...] + +A basic environmental data management system + +positional arguments: + infiles Input CSV(s) containing measurement data + +optional arguments: + -h, --help show this help message and exit +~~~ +{: .output} + +The help page starts with the command line usage, +illustrating what inputs can be given (any within `[]` brackets are optional). +It then lists the **positional** and **optional** arguments, +giving as detailed a description of each as you have added to the `add_argument()` command. +Positional arguments are arguments that need to be included +in the proper position or order when calling the script. + +Note that optional arguments are indicated by `-` or `--`, followed by the argument name. +Positional arguments are simply inferred by their position. +It is possible to have multiple positional arguments, +but usually this is only practical where all (or all but one) positional arguments +contains a clearly defined number of elements. +If more than one option can have an indeterminate number of entries, +then it is better to create them as 'optional' arguments. +These can be made a required input though, +by setting `required = True` within the `add_argument()` command. + +> ## Positional and Optional Argument Order +> +> The usage section of the help page above shows +> the optional arguments going before the positional arguments. +> This is the customary way to present options, but is not mandatory. +> Instead there are two rules which must be followed for these arguments: +> +> 1. Positional and optional arguments must each be given all together, and not inter-mixed. +> For example, the order can be either `optional - positional` or `positional - optional`, +> but not `optional - positional - optional`. +> 2. Positional arguments must be given in the order that they are shown +> in the usage section of the help page. +{: .callout} + +Now that you have some familiarity with `argparse`, +we will demonstrate below how you can use this to add extra functionality to your controller. + +### Choosing the Measurement Dataseries + +Up until now we have only read the rainfall data from our `data/rain_data_2015-12.csv` file. +But what if we want to read the river measurement data too? +We can, simply, change the file that we are reading, +by passing a different file name. +But when we do this with the river data we get the following error: +~~~ +python catchment-analysis.py data/river_data_2015-12.csv +~~~ +{: .language-bash} +~~~ +Traceback (most recent call last): + File "/Users/mbessdl2/work/manchester/Course_Material/Intermediate_Programming_Skills/python-intermediate-rivercatchment-template/catchment-analysis.py", line 39, in + main(args) + File "/Users/mbessdl2/work/manchester/Course_Material/Intermediate_Programming_Skills/python-intermediate-rivercatchment-template/catchment-analysis.py", line 22, in main + measurement_data = models.read_variable_from_csv(filename) + File "/Users/mbessdl2/work/manchester/Course_Material/Intermediate_Programming_Skills/python-intermediate-rivercatchment-template/catchment/models.py", line 22, in read_variable_from_csv + dataset = pd.read_csv(filename, usecols=['Date', 'Site', 'Rainfall (mm)']) +... +ValueError: Usecols do not match columns, columns expected but not found: ['Rainfall (mm)'] +~~~ +{: .output} + +This error message tells us that the pandas `read_csv` function +has failed to find one of the columns that are listed to be read. +We would not expect a column called `'Rainfall (mm)'` in the river data file, +so we need to make the `read_variable_from_csv` more flexible, +so that it can read any defined measurement dataset. + +The first step is to add an argument to our command line interface, +so that users can specify the measurement dataset. +This can be done by adding the following argument to your `catchment-analysis.py` script: +~~~ + parser.add_argument( + '-m', '--measurements', + help = 'Name of measurement data series to load', + required = True) +~~~ +{: .language-python} +Here we have defined the name of the argument (`--measurements`), +as well as a short name (`-m`) for lazy users to use. +Note that the short name is preceded by a single dash (`-`), +while the full name is preceded by two dashes (`--`). +We provide a `help` string for the user, +and finally we set `required = True`, +so that the end user must define which data series they want to read. + +Once this is added, then your help message should look like this: +~~~ +python catchment-analysis.py --help +~~~ +{: .language-bash} +~~~ +usage: catchment-analysis.py [-h] -m MEASUREMENTS infiles [infiles ...] + +A basic environmental data management system + +positional arguments: + infiles Input CSV(s) containing measurement data + +optional arguments: + -h, --help show this help message and exit + -m MEASUREMENTS, --measurements MEASUREMENTS + Name of measurement data series to use +~~~ +{: .output} + +> ## Optional vs Required Arguments, and Argument Groups +> You will note that the `--measurements` argument is still listed as an optional argument. +> This is because the two basic option groups in `argparse` are +> positional and optional. +> In the usage section the `--measurements` option is listed without `[]` brackets, +> indicating that it is an expected argument, +> but still this is not very clear for end users. +> +> To make the help clearer we can add an extra argument group, +> and assign `--measurements` to this: +> ~~~ +> ... +> req_group = parser.add_argument_group('required arguments') +> ... +> req_group.add_argument( +> '-m', '--measurements', +> help = 'Name of measurement data series to load', +> required = True) +> ... +> ~~~ +> {: .language-python} +> This will return the following help message: +> ~~~ +> python catchment-analysis.py --help +> ~~~ +> {: .language-bash} +> ~~~ +> usage: catchment-analysis.py [-h] -m MEASUREMENTS infiles [infiles ...] +> +> A basic environmental data management system +> +> positional arguments: +> infiles Input CSV(s) containing measurement data +> +> optional arguments: +> -h, --help show this help message and exit +> +> required arguments: +> -m MEASUREMENTS, --measurements MEASUREMENTS +> Name of measurement data series to use +> ~~~ +> {: .output} +> This solution is not perfect, because the positional arguments are also required, +> but it will at least help end users distinguish between optional and required flagged arguments. +{: .callout} + +> ## Default Argument Number and Type +> `argparse` will, by default, assume that each argument added will take a single value, +> and will be a string (`type = str`). If you want to change this for any argument you +> should explicitly set `type` and `nargs`. +> +> Note also, that the returned object will be a single item unless `nargs` has been set, +> in which case a list of items is returned (even if `nargs = 1` is used). +{: .callout} + + +#### Controller and Model Adaption + +The new measurement string needs to be passed to the `read_variable_from_csv` function, +and applied appropriately within that function. +First we add a `measurements` argument to the `read_variable_from_csv` function in `catchment/models.py` +(remembering to update the function docstring at the same time): +~~~ +# catchment/models.py +... +def read_variable_from_csv(filename, measurement): + """Reads a named variable from a CSV file, and returns a + pandas dataframe containing that variable. The CSV file must contain + a column of dates, a column of site ID's, and (one or more) columns + of data - only one of which will be read. + + :param filename: Filename of CSV to load + :param measurement: Name of data column to be read + :return: 2D array of given variable. Index will be dates, + Columns will be the individual sites + """ +... +~~~ +{: .language-python} +Following this we need to change two lines of code, +the first being the CSV reading code, +and the second being the code which reorganises the dataset before it is returned: +~~~ +# catchment/models.py +... +def read_variable_from_csv(filename, measurement): +... + dataset = pd.read_csv(filename, usecols=['Date', 'Site', measurement]) +... + for site in dataset['Site'].unique(): + newdataset[site] = dataset[dataset['Site'] == site].set_index('Date')[measurement] +... +~~~ +{: .language-python} + + +Finally, within the `main` function of the controller we should add `args.measurements` as an argument: +~~~ +# catchment-analysis.py +... +def main(args): +... + for filename in in_files: + measurement_data = models.read_variable_from_csv(filename, args.measurements) +... +~~~ +{: .language-python} + +You can now test your new code, to ensure it works as expected: +~~~ +python catchment-analysis.py -m 'Rainfall (mm)' data/rain_data_2015-12.csv +~~~ +{: .language-bash} +![Rainfall daily metrics](../fig/rainfall_daily_metrics.png){: .image-with-shadow width="800px" } + +~~~ +python catchment-analysis.py -m 'pH continuous' data/river_data_2015-12.csv +~~~ +{: .language-bash} +![River pH daily metrics](../fig/pH_daily_metrics.png){: .image-with-shadow width="800px" } + +Note that we have to use quotation marks to +pass any strings which contain spaces or special characters, +so that they are properly read by the parser. + + + +> ## Additional Material +> +> Now that we've covered the basics of different programming paradigms +> and how we can integrate them into our multi-layer architecture, +> there are two optional extra episodes which you may find interesting. +> +> Both episodes cover the persistence layer of software architectures +> and methods of persistently storing data, but take different approaches. +> The episode on [persistence with JSON](../persistence) covers +> some more advanced concepts in Object Oriented Programming, while +> the episode on [databases](../databases) starts to build towards a true multilayer architecture, +> which would allow our software to handle much larger quantities of data. +{: .callout} + + +## Towards Collaborative Software Development + +Having looked at some theoretical aspects of software design, +we are now circling back to implementing our software design +and developing our software to satisfy the requirements collaboratively in a team. +At an intermediate level of software development, +there is a wealth of practices that could be used, +and applying suitable design and coding practices is what separates +an intermediate developer from someone who has just started coding. +The key for an intermediate developer is to balance these concerns +for each software project appropriately, +and employ design and development practices enough so that progress can be made. + +One practice that should always be considered, +and has been shown to be very effective in team-based software development, +is that of *code review*. +Code reviews help to ensure the 'good' coding standards are achieved +and maintained within a team by having multiple people +have a look and comment on key code changes to see how they fit within the codebase. +Such reviews check the correctness of the new code, test coverage, functionality changes, +and confirm that they follow the coding guides and best practices. +In the following episodes we will have a look at some code review techniques available to us.