From 0612e82af9f0e9c57141526ca6190ff25ec2ea04 Mon Sep 17 00:00:00 2001 From: Douglas Lowe <10961945+douglowe@users.noreply.github.com> Date: Tue, 5 Mar 2024 22:19:18 +0000 Subject: [PATCH 01/12] add refactor lessons from original course PR --- _config.yml | 5 +- _extras/refactor-code-abstractions.md | 468 ++++++++++++++++++++++++++ _extras/refactor-code-refactoring.md | 296 ++++++++++++++++ _extras/refactor-software-design.md | 238 +++++++++++++ 4 files changed, 1005 insertions(+), 2 deletions(-) create mode 100644 _extras/refactor-code-abstractions.md create mode 100644 _extras/refactor-code-refactoring.md create mode 100644 _extras/refactor-software-design.md diff --git a/_config.yml b/_config.yml index 2ec03d3a8..8823b6926 100644 --- a/_config.yml +++ b/_config.yml @@ -92,10 +92,11 @@ extras_order: - figures - guide - common-issues - - discuss - protect-main-branch - vscode - - functional-programming + - refactor-software-design + - refactor-code-refactoring + - refactor-code-abstractions - persistence - databases - geopandas diff --git a/_extras/refactor-code-abstractions.md b/_extras/refactor-code-abstractions.md new file mode 100644 index 000000000..409f8312b --- /dev/null +++ b/_extras/refactor-code-abstractions.md @@ -0,0 +1,468 @@ +--- +title: "Refactor 3: Code Abstractions" +teaching: 30 +exercises: 45 +questions: +- "When is it useful to use classes to structure code?" +- "How can we make sure the components of our software are reusable?" +objectives: +- "Introduce appropriate abstractions to simplify code." +- "Understand the principles of encapsulation, polymorphism and interfaces." +- "Use mocks to replace a class in test code." +keypoints: +- "Classes and interfaces can help decouple code so it is easier to understand, test and maintain." +- "Encapsulation is bundling related data into a structured component, +along with the methods that operate on the data. It is also provides a mechanism for restricting +the access to that data, hiding the internal representation of the component." +- "Polymorphism describes the provision of a single interface to entities of different types, +or the use of a single symbol to represent different types." +--- + +## Introduction + +*Code abstraction* is the process of hiding the implementation details of a piece of +code behind an interface - i.e. the details of *how* something works are hidden away, +leaving us to deal only with *what* it does. +This allows developers to work with the code at a higher level +of abstraction, without needing to understand fully (or keep in mind) all the underlying +details and thereby reducing the cognitive load when programming. + +Abstractions can aid decoupling of code. +If one part of the code only uses another part through an appropriate abstraction +then it becomes easier for these parts to change independently. + +Let's start redesigning our code by introducing some of the abstraction techniques +to incrementally improve its design. + +You may have noticed that loading data from CSV files in a directory is "baked" into +(i.e. is part of) the `analyse_data()` function. +This is not strictly a functionality of the data analysis function, so firstly +let's decouple the data loading into a separate function. + +> ## Exercise: Decouple Data Loading from Data Analysis +> Separate out the data loading functionality from `analyse_data()` into a new function +> `load_inflammation_data()` that returns all the files to load. +>> ## Solution +>> The new function `load_inflammation_data()` that reads all the data into the format needed +>> for the analysis should look something like: +>> ```python +>> def load_inflammation_data(dir_path): +>> data_file_paths = glob.glob(os.path.join(dir_path, 'inflammation*.csv')) +>> if len(data_file_paths) == 0: +>> raise ValueError(f"No inflammation csv's found in path {dir_path}") +>> data = map(models.load_csv, data_file_paths) +>> return list(data) +>> ``` +>> This function can now be used in the analysis as follows: +>> ```python +>> def analyse_data(data_dir): +>> data = load_inflammation_data(data_dir) +>> daily_standard_deviation = compute_standard_deviation_by_data(data) +>> ... +>> ``` +>> The code is now easier to follow since we do not need to understand the the data loading from +>> files to read the statistical analysis, and vice versa - we do not have to understand the +>> statistical analysis when looking at data loading. +>> Ensure you re-run the regression tests to check this refactoring has not +>> changed the output of `analyse_data()`. +> {: .solution} +{: .challenge} + +However, even with this change, the data loading is still coupled with the data analysis. +For example, if we have to support loading data from different sources +(e.g. JSON files and CSV files), we would have to pass some kind of a flag indicating +what we want into `analyse_data()`. Instead, we would like to decouple the +consideration of what data to load from the `analyse_data()` function entirely. +One way we can do this is by using *encapsulation* and *classes*. + +## Encapsulation & Classes + +*Encapsulation* is the packing of "data" and "functions operating on that data" into a +single component/object. +It is also provides a mechanism for restricting the access to that data. +Encapsulation means that the internal representation of a component is generally hidden +from view outside of the component's definition. + +Encapsulation allows developers to present a consistent interface to an object/component +that is independent of its internal implementation. +For example, encapsulation can be used to hide the values or +state of a structured data object inside a **class**, preventing direct access to them +that could violate the object's state maintained by the class' methods. +Note that object-oriented programming (OOP) languages support encapsulation, +but encapsulation is not unique to OOP. + +So, a class is a way of grouping together data with some methods that manipulate that data. +In Python, you can *declare* a class as follows: + +```python +class Circle: + pass +``` + +Classes are typically named using "CapitalisedWords" naming convention - e.g. FileReader, +OutputStream, Rectangle. + +You can *construct* an *instance* of a class elsewhere in the code by doing the following: + +```python +my_circle = Circle() +``` + +When you construct a class in this ways, the class' *constructor* method is called. +It is also possible to pass values to the constructor in order to configure the class instance: + +```python +class Circle: + def __init__(self, radius): + self.radius = radius + +my_circle = Circle(10) +``` + +The constructor has the special name `__init__`. +Note it has a special first parameter called `self` by convention - it is +used to access the current *instance* of the object being created. + +A class can be thought of as a cookie cutter template, and instances as the cookies themselves. +That is, one class can have many instances. + +Classes can also have other methods defined on them. +Like constructors, they have the special parameter `self` that must come first. + +```python +import math + +class Circle: + ... + def get_area(self): + return math.pi * self.radius * self.radius +... +print(my_circle.get_area()) +``` + +On the last line of the code above, the instance of the class, `my_circle`, will be automatically +passed as the first parameter (`self`) when calling the `get_area()` method. +The `get_area()` method can then access the variable `radius` encapsulated within the object, which +is otherwise invisible to the world outside of the object. +The method `get_area()` itself can also be accessed via the object/instance only. + +As we can see, internal representation of any instance of class `Circle` is hidden +outside of this class (encapsulation). +In addition, implementation of the method `get_area()` is hidden too (abstraction). + +> ## Encapsulation & Abstraction +> Encapsulation provides **information hiding**. Abstraction provides **implementation hiding**. +{: .callout} + +> ## Exercise: Use Classes to Abstract out Data Loading +> Declare a new class `CSVDataSource` that contains the `load_inflammation_data` function +> we wrote in the previous exercise as a method of this class. +> The directory path where to load the files from should be passed in the class' constructor method. +> Finally, construct an instance of the class `CSVDataSource` outside the statistical +> analysis and pass it to `analyse_data()` function. +>> ## Hint +>> At the end of this exercise, the code in the `analyse_data()` function should look like: +>> ```python +>> def analyse_data(data_source): +>> data = data_source.load_inflammation_data() +>> daily_standard_deviation = compute_standard_deviation_by_data(data) +>> ... +>> ``` +>> The controller code should look like: +>> ```python +>> data_source = CSVDataSource(os.path.dirname(InFiles[0])) +>> analyse_data(data_source) +>> ``` +> {: .solution} +>> ## Solution +>> For example, we can declare class `CSVDataSource` like this: +>> +>> ```python +>> class CSVDataSource: +>> """ +>> Loads all the inflammation CSV files within a specified directory. +>> """ +>> def __init__(self, dir_path): +>> self.dir_path = dir_path +>> +>> def load_inflammation_data(self): +>> data_file_paths = glob.glob(os.path.join(self.dir_path, 'inflammation*.csv')) +>> if len(data_file_paths) == 0: +>> raise ValueError(f"No inflammation CSV files found in path {self.dir_path}") +>> data = map(models.load_csv, data_file_paths) +>> return list(data) +>> ``` +>> In the controller, we create an instance of CSVDataSource and pass it +>> into the the statistical analysis function. +>> +>> ```python +>> data_source = CSVDataSource(os.path.dirname(InFiles[0])) +>> analyse_data(data_source) +>> ``` +>> The `analyse_data()` function is modified to receive any data source object (that implements +>> the `load_inflammation_data()` method) as a parameter. +>> ```python +>> def analyse_data(data_source): +>> data = data_source.load_inflammation_data() +>> daily_standard_deviation = compute_standard_deviation_by_data(data) +>> ... +>> ``` +>> We have now fully decoupled the reading of the data from the statistical analysis and +>> the analysis is not fixed to reading from a directory of CSV files. Indeed, we can pass various +>> data sources to this function now, as long as they implement the `load_inflammation_data()` +>> method. +>> +>> While the overall behaviour of the code and its results are unchanged, +>> the way we invoke data analysis has changed. +>> We must update our regression test to match this, to ensure we have not broken anything: +>> ```python +>> ... +>> def test_compute_data(): +>> from inflammation.compute_data import analyse_data +>> path = Path.cwd() / "../data" +>> data_source = CSVDataSource(path) +>> result = analyse_data(data_source) +>> expected_output = [0.,0.22510286,0.18157299,0.1264423,0.9495481,0.27118211 +>> ... +>> ``` +> {: .solution} +{: .challenge} + + +## Interfaces + +An interface is another important concept in software design related to abstraction and +encapsulation. For a software component, it declares the operations that can be invoked on +that component, along with input arguments and what it returns. By knowing these details, +we can communicate with this component without the need to know how it implements this interface. + +API (Application Programming Interface) is one example of an interface that allows separate +systems (external to one another) to communicate with each other. +For example, a request to Google Maps service API may get +you the latitude and longitude for a given address. +Twitter API may return all tweets that contain +a given keyword that have been posted within a certain date range. + +Internal interfaces within software dictate how +different parts of the system interact with each other. +Even when these are not explicitly documented or thought out, they still exist. + +For example, our `Circle` class implicitly has an interface - you can call `get_area()` method +on it and it will return a number representing its surface area. + +> ## Exercise: Identify an Interface Between `CSVDataSource` and `analyse_data` +> What is the interface between CSVDataSource class and `analyse_data()` function. +> Think about what functions `analyse_data()` needs to be able to call to perform its duty, +> what parameters they need and what they return. +>> ## Solution +>> The interface is the `load_inflammation_data()` method, which takes no parameters and +>> returns a list where each entry is a 2D array of patient inflammation data (read from some +> data source). +>> +>> Any object passed into `analyse_data()` should conform to this interface. +> {: .solution} +{: .challenge} + + +## Polymorphism + +In general, polymorphism is the idea of having multiple implementations/forms/shapes +of the same abstract concept. +It is the provision of a single interface to entities of different types, +or the use of a single symbol to represent multiple different types. + +There are [different versions of polymorphism](https://www.bmc.com/blogs/polymorphism-programming/). +For example, method or operator overloading is one +type of polymorphism enabling methods and operators to take parameters of different types. + +We will have a look at the interface-based polymorphism. +In OOP, it is possible to have different object classes that conform to the same interface. +For example, let's have a look at the following class representing a `Rectangle`: + +```python +class Rectangle: + def __init__(self, width, height): + self.width = width + self.height = height + def get_area(self): + return self.width * self.height +``` + +Like `Circle`, this class provides the `get_area()` method. +The method takes the same number of parameters (none), and returns a number. +However, the implementation is different. This is one type of *polymorphism*. + +The word "polymorphism" means "many forms", and in programming it refers to +methods/functions/operators with the same name that can be executed on many objects or classes. + +Using our `Circle` and `Rectangle` classes, we can create a list of different shapes and iterate +through the list to find their total surface area as follows: + +```python +my_circle = Circle(radius=10) +my_rectangle = Rectangle(width=5, height=3) +my_shapes = [my_circle, my_rectangle] +total_area = sum(shape.get_area() for shape in my_shapes) +``` + +Note that we have not created a common superclass or linked the classes `Circle` and `Rectangle` +together in any way. It is possible due to polymorphism. +You could also say that, when we are calculating the total surface area, +the method for calculating the area of each shape is abstracted away to the relevant class. + +How can polymorphism be useful in our software project? +For example, we can replace our `CSVDataSource` with another class that reads a totally +different file format (e.g. JSON instead of CSV), or reads from an external service or database +All of these changes can be now be made without changing the analysis function as we have decoupled +the process of data loading from the data analysis earlier. +Conversely, if we wanted to write a new analysis function, we could support any of these +data sources with no extra work. + +> ## Exercise: Add an Additional DataSource +> Create another class that supports loading patient data from JSON files, with the +> appropriate `load_inflammation_data()` method. +> There is a function in `models.py` that loads from JSON in the following format: +> ```json +> [ +> { +> "observations": [0, 1] +> }, +> { +> "observations": [0, 2] +> } +> ] +> ``` +> Finally, at run time construct an appropriate instance based on the file extension. +>> ## Solution +>> The new class could look something like: +>> ```python +>> class JSONDataSource: +>> """ +>> Loads patient data with inflammation values from JSON files within a specified folder. +>> """ +>> def __init__(self, dir_path): +>> self.dir_path = dir_path +>> +>> def load_inflammation_data(self): +>> data_file_paths = glob.glob(os.path.join(self.dir_path, 'inflammation*.json')) +>> if len(data_file_paths) == 0: +>> raise ValueError(f"No inflammation JSON's found in path {self.dir_path}") +>> data = map(models.load_json, data_file_paths) +>> return list(data) +>> ``` +>> Additionally, in the controller will need to select the appropriate DataSource to +>> provide to the analysis: +>>```python +>> _, extension = os.path.splitext(InFiles[0]) +>> if extension == '.json': +>> data_source = JSONDataSource(os.path.dirname(InFiles[0])) +>> elif extension == '.csv': +>> data_source = CSVDataSource(os.path.dirname(InFiles[0])) +>> else: +>> raise ValueError(f'Unsupported file format: {extension}') +>> analyse_data(data_source) +>>``` +>> As you can seen, all the above changes have been made made without modifying +>> the analysis code itself. +> {: .solution} +{: .challenge} + +## Testing Using Mock Objects + +We can use this abstraction to also make testing more straight forward. +Instead of having our tests use real file system data, we can instead provide +a mock or dummy implementation instead of one of the real classes. +Providing that what we use as a substitute conforms to the same interface, +the code we are testing should work just the same. +Such mock/dummy implementation could just returns some fixed example data. + +An convenient way to do this in Python is using Python's [mock object library](https://docs.python.org/3/library/unittest.mock.html). +This is a whole topic in itself - +but a basic mock can be constructed using a couple of lines of code: + +```python +from unittest.mock import Mock + +mock_version = Mock() +mock_version.method_to_mock.return_value = 42 +``` + +Here we construct a mock in the same way you would construct a class. +Then we specify a method that we want to behave a specific way. + +Now whenever you call `mock_version.method_to_mock()` the return value will be `42`. + + +> ## Exercise: Test Using a Mock Implementation +> Complete this test for `analyse_data()`, using a mock object in place of the +> `data_source`: +> ```python +> from unittest.mock import Mock +> +> def test_compute_data_mock_source(): +> from inflammation.compute_data import analyse_data +> data_source = Mock() +> +> # TODO: configure data_source mock +> +> result = analyse_data(data_source) +> +> # TODO: add assert on the contents of result +> ``` +> Create a mock that returns some fixed data and to use as the `data_source` in order to test +> the `analyse_data` method. +> Use this mock in a test. +> +> Do not forget to import `Mock` from the `unittest.mock` package. +>> ## Solution +>> ```python +>> from unittest.mock import Mock +>> +>> def test_compute_data_mock_source(): +>> from inflammation.compute_data import analyse_data +>> data_source = Mock() +>> data_source.load_inflammation_data.return_value = [[[0, 2, 0]], +>> [[0, 1, 0]]] +>> +>> result = analyse_data(data_source) +>> npt.assert_array_almost_equal(result, [0, math.sqrt(0.25) ,0]) +>> ``` +> {: .solution} +{: .challenge} + +## Programming Paradigms + +Until now, we have mainly been writing procedural code. +In the previous episode, we mentioned [pure functions](/33-code-refactoring/index.html#pure-functions) +and Functional Programming. +In this episode, we have touched a bit upon classes, encapsulation and polymorphism, +which are characteristics of (but not limited to) the Object Oriented Programming (OOP). +All these different programming paradigms provide varied approaches to structuring your code - +each with certain strengths and weaknesses when used to solve particular types of problems. +In many cases, particularly with modern languages, a single language can allow many different +structural approaches and mixing programming paradigms within your code. +Once your software begins to get more complex - it is common to use aspects of [different paradigm](/programming-paradigms/index.html) +to handle different subtasks. +Because of this, it is useful to know about the [major paradigms](/programming-paradigms/index.html), +so you can recognise where it might be useful to switch. +This is outside of scope of this course - we have some extra episodes on the topics of +[Procedural Programming](/programming-paradigms/index.html#procedural-programming), +[Functional Programming](/functional-programming/index.html) and +[Object Oriented Programming](/object-oriented-programming/index.html) if you want to know more. + +> ## So Which One is Python? +> Python is a multi-paradigm and multi-purpose programming language. +> You can use it as a procedural language and you can use it in a more object oriented way. +> It does tend to land more on the object oriented side as all its core data types +> (strings, integers, floats, booleans, lists, +> sets, arrays, tuples, dictionaries, files) +> as well as functions, modules and classes are objects. +> +> Since functions in Python are also objects that can be passed around like any other object, +> Python is also well suited to functional programming. +> One of the most popular Python libraries for data manipulation, +> [Pandas](https://pandas.pydata.org/) (built on top of NumPy), +> supports a functional programming style +> as most of its functions on data are not changing the data (no side effects) +> but producing a new data to reflect the result of the function. +{: .callout} diff --git a/_extras/refactor-code-refactoring.md b/_extras/refactor-code-refactoring.md new file mode 100644 index 000000000..ca3a1c451 --- /dev/null +++ b/_extras/refactor-code-refactoring.md @@ -0,0 +1,296 @@ +--- +title: "Refactor 2: Code Refactoring" +teaching: 30 +exercises: 20 +questions: +- "How do you refactor code without breaking it?" +- "What is decoupled code?" +- "What are benefits of using pure functions in our code?" +objectives: +- "Understand the benefits of code decoupling." +- "Understand the use of regressions tests to avoid breaking existing code when refactoring." +- "Understand the use of pure functions in software design to make the code easier to test." +- "Refactor a piece of code to separate out 'pure' from 'impure' code." +keypoints: +- "Implementing regression tests before refactoring gives you confidence that your changes have not +broken the code." +- "Decoupling code into pure functions that process data without side effects makes code easier +to read, test and maintain." +--- + +## Introduction + +*Code refactoring* is the process of improving the design of an existing code - for example +to make it more decoupled. +Recall that *code decoupling* means breaking the system into smaller components and reducing the +interdependence between these components, so that they can be tested and maintained independently. +Two components of code can be considered **decoupled** if a change in one does not +necessitate a change in the other. +While two connected units cannot always be totally decoupled, **loose coupling** +is something we should aim for. Benefits of decoupled code include: + +* easier to read as you do not need to understand the + details of the other component. +* easier to test, as one of the components can be replaced + by a test or a mock version of it. +* code tends to be easier to maintain, as changes can be isolated + from other parts of the code. + +When faced with an existing piece of code that needs modifying a good refactoring +process to follow is: + +1. Make sure you have tests that verify the current behaviour +2. Refactor the code +3. Verify that that the behaviour of the code is identical to that before refactoring. + +In this episode we will refactor the function `analyse_data()` in `compute_data.py` +from our project in the following two ways: +* add more tests so we can be more confident that future changes will have the +intended effect and will not break the existing code. +* split the monolithic `analyse_data()` function into a number of smaller and mode decoupled functions +making the code easier to understand and test. + +## Writing Tests Before Refactoring + +When refactoring, first we need to make sure there are tests that verity +the code behaviour as it is now (or write them if they are missing), +then refactor the code and, finally, check that the original tests still pass. +This is to make sure we do not break the existing behaviour through refactoring. + +There is a bit of a "chicken and egg" problem here - if the refactoring is supposed to make it easier +to write tests in the future, how can we write tests before doing the refactoring? +The tricks to get around this trap are: + + * Test at a higher level, with coarser accuracy + * Write tests that you intend to remove + +The best tests are ones that test single bits of functionality rigorously. +However, with our current `analyse_data()` code that is not possible because it is a +large function doing a little bit of everything. +Instead we will make minimal changes to the code to make it a bit more testable. + +Firstly, +we will modify the function to return the data instead of visualising it because graphs are harder +to test automatically (i.e. they need to be viewed and inspected manually in order to determine +their correctness). +Next, we will make the assert statements verify what the outcome is +currently, rather than checking whether that is correct or not. +Such tests are meant to +verify that the behaviour does not *change* rather than checking the current behaviour is correct +(there should be another set of tests checking the correctness). +This kind of testing is called **regression testing** as we are testing for +regressions in existing behaviour. + +Refactoring code is not meant to change its behaviour, but sometimes to make it possible to verify +you not changing the important behaviour you have to make small tweaks to the code to write +the tests at all. + +> ## Exercise: Write Regression Tests +> Modify the `analyse_data()` function not to plot a graph and return the data instead. +> Then, add a new test file called `test_compute_data.py` in the `tests` folder and +> add a regression test to verify the current output of `analyse_data()`. We will use this test +> in the remainder of this section to verify the output `analyse_data()` is unchanged each time +> we refactor or change code in the future. +> +> Start from the skeleton test code below: +> +> ```python +> def test_analyse_data(): +> from inflammation.compute_data import analyse_data +> path = Path.cwd() / "../data" +> result = analyse_data(path) +> +> # TODO: add an assert for the value of result +> ``` +> Use `assert_array_almost_equal` from the `numpy.testing` library to +> compare arrays of floating point numbers. +> +>> ## Hint +>> When determining the correct return data result to use in tests, it may be helpful to assert the +>> result equals some random made-up data, observe the test fail initially and then +>> copy and paste the correct result into the test. +> {: .solution} +> +>> ## Solution +>> One approach we can take is to: +>> * comment out the visualize method on `analyse_data()` +>> (as this will cause our test to hang waiting for the result data) +>> * return the data instead, so we can write asserts on the data +>> * See what the calculated value is, and assert that it is the same as the expected value +>> +>> Putting this together, your test may look like: +>> +>> ```python +>> import numpy.testing as npt +>> from pathlib import Path +>> +>> def test_analyse_data(): +>> from inflammation.compute_data import analyse_data +>> path = Path.cwd() / "../data" +>> result = analyse_data(path) +>> expected_output = [0.,0.22510286,0.18157299,0.1264423,0.9495481,0.27118211, +>> 0.25104719,0.22330897,0.89680503,0.21573875,1.24235548,0.63042094, +>> 1.57511696,2.18850242,0.3729574,0.69395538,2.52365162,0.3179312, +>> 1.22850657,1.63149639,2.45861227,1.55556052,2.8214853,0.92117578, +>> 0.76176979,2.18346188,0.55368435,1.78441632,0.26549221,1.43938417, +>> 0.78959769,0.64913879,1.16078544,0.42417995,0.36019114,0.80801707, +>> 0.50323031,0.47574665,0.45197398,0.22070227] +>> npt.assert_array_almost_equal(result, expected_output) +>> ``` +>> +>> Note that while the above test will detect if we accidentally break the analysis code and +>> change the output of the analysis, is not a good or complete test for the following reasons: +>> * It is not at all obvious why the `expected_output` is correct +>> * It does not test edge cases +>> * If the data files in the directory change - the test will fail +>> +>> We would need additional tests to check the above. +> {: .solution} +{: .challenge} + +## Separating Pure and Impure Code + +Now that we have our regression test for `analyse_data()` in place, we are ready to refactor the +function further. +We would like to separate out as much of its code as possible as *pure functions*. +Pure functions are very useful and much easier to test as they take input only from its input +parameters and output only via their return values. + +### Pure Functions + +A pure function in programming works like a mathematical function - +it takes in some input and produces an output and that output is +always the same for the same input. +That is, the output of a pure function does not depend on any information +which is not present in the input (such as global variables). +Furthermore, pure functions do not cause any *side effects* - they do not modify the input data +or data that exist outside the function (such as printing text, writing to a file or +changing a global variable). They perform actions that affect nothing but the value they return. + +### Benefits of Pure Functions + +Pure functions are easier to understand because they eliminate side effects. +The reader only needs to concern themselves with the input +parameters of the function and the function code itself, rather than +the overall context the function is operating in. +Similarly, a function that calls a pure function is also easier +to understand - we only need to understand what the function returns, which will probably +be clear from the context in which the function is called. +Finally, pure functions are easier to reuse as the caller +only needs to understand what parameters to provide, rather +than anything else that might need to be configured prior to the call. +For these reasons, you should try and have as much of the complex, analytical and mathematical +code are pure functions. + + +Some parts of a program are inevitably impure. +Programs need to read input from users, generate a graph, or write results to a file or a database. +Well designed programs separate complex logic from the necessary impure "glue" code that +interacts with users and other systems. +This way, you have easy-to-read and easy-to-test pure code that contains the complex logic +and simplified impure code that reads data from a file or gathers user input. Impure code may +be harder to test but, when simplified like this, may only require a handful of tests anyway. + +> ## Exercise: Refactoring To Use a Pure Function +> Refactor the `analyse_data()` function to delegate the data analysis to a new +> pure function `compute_standard_deviation_by_day()` and separate it +> from the impure code that handles the input and output. +> The pure function should take in the data, and return the analysis result, as follows: +> ```python +> def compute_standard_deviation_by_day(data): +> # TODO +> return daily_standard_deviation +> ``` +>> ## Solution +>> The analysis code will be refactored into a separate function that may look something like: +>> ```python +>> def compute_standard_deviation_by_day(data): +>> means_by_day = map(models.daily_mean, data) +>> means_by_day_matrix = np.stack(list(means_by_day)) +>> +>> daily_standard_deviation = np.std(means_by_day_matrix, axis=0) +>> return daily_standard_deviation +>> ``` +>> The `analyse_data()` function now calls the `compute_standard_deviation_by_day()` function, +>> while keeping all the logic for reading the data, processing it and showing it in a graph: +>>```python +>>def analyse_data(data_dir): +>> """Calculate the standard deviation by day between datasets +>> Gets all the inflammation csvs within a directory, works out the mean +>> inflammation value for each day across all datasets, then graphs the +>> standard deviation of these means.""" +>> data_file_paths = glob.glob(os.path.join(data_dir, 'inflammation*.csv')) +>> if len(data_file_paths) == 0: +>> raise ValueError(f"No inflammation csv's found in path {data_dir}") +>> data = map(models.load_csv, data_file_paths) +>> daily_standard_deviation = compute_standard_deviation_by_day(data) +>> +>> graph_data = { +>> 'standard deviation by day': daily_standard_deviation, +>> } +>> # views.visualize(graph_data) +>> return daily_standard_deviation +>>``` +>> Make sure to re-run the regression test to check this refactoring has not +>> changed the output of `analyse_data()`. +> {: .solution} +{: .challenge} + +### Testing Pure Functions + +Now we have our analysis implemented as a pure function, we can write tests that cover +all the things we would like to check without depending on CSVs files. +This is another advantage of pure functions - they are very well suited to automated testing, +i.e. their tests are: +* **easier to write** - we construct input and assert the output +without having to think about making sure the global state is correct before or after +* **easier to read** - the reader will not have to open a CSV file to understand why +the test is correct +* **easier to maintain** - if at some point the data format changes +from CSV to JSON, the bulk of the tests need not be updated + +> ## Exercise: Testing a Pure Function +> Add tests for `compute_standard_deviation_by_data()` that check for situations +> when there is only one file with multiple rows, +> multiple files with one row, and any other cases you can think of that should be tested. +>> ## Solution +>> You might have thought of more tests, but we can easily extend the test by parametrizing +>> with more inputs and expected outputs: +>> ```python +>>@pytest.mark.parametrize('data,expected_output', [ +>> ([[[0, 1, 0], [0, 2, 0]]], [0, 0, 0]), +>> ([[[0, 2, 0]], [[0, 1, 0]]], [0, math.sqrt(0.25), 0]), +>> ([[[0, 1, 0], [0, 2, 0]], [[0, 1, 0], [0, 2, 0]]], [0, 0, 0]) +>>], +>>ids=['Two patients in same file', 'Two patients in different files', 'Two identical patients in two different files']) +>>def test_compute_standard_deviation_by_day(data, expected_output): +>> from inflammation.compute_data import compute_standard_deviation_by_data +>> +>> result = compute_standard_deviation_by_data(data) +>> npt.assert_array_almost_equal(result, expected_output) +``` +> {: .solution} +{: .challenge} + +> ## Functional Programming +> **Functional programming** is a programming paradigm where programs are constructed by +> applying and composing/chaining pure functions. +> Some programming languages, such as Haskell or Lisp, support writing pure functional code only. +> Other languages, such as Python, Java, C++, allow mixing **functional** and **procedural** +> programming paradigms. +> Read more in the [extra episode on functional programming](/functional-programming/index.html) +> and when it can be very useful to switch to this paradigm +> (e.g. to employ MapReduce approach for data processing). +{: .callout} + + +There are no definite rules in software design but making your complex logic out of +composed pure functions is a great place to start when trying to make your code readable, +testable and maintainable. This is particularly useful for: + +* Data processing and analysis +(for example, using [Python Pandas library](https://pandas.pydata.org/) for data manipulation where most of functions appear pure) +* Doing simulations +* Translating data from one format to another + +{% include links.md %} diff --git a/_extras/refactor-software-design.md b/_extras/refactor-software-design.md new file mode 100644 index 000000000..31b45e411 --- /dev/null +++ b/_extras/refactor-software-design.md @@ -0,0 +1,238 @@ +--- +title: "Refactor 1: Software Design" +teaching: 25 +exercises: 20 +questions: +- "Why should we invest time in software design?" +- "What should we consider when designing software?" +objectives: +- "Understand the goals and principles of designing 'good' software." +- "Understand code decoupling and code abstraction design techniques." +- "Understand what code refactoring is." +keypoints: +- "'Good' code is designed to be maintainable: readable by people who did not author the code, +testable through a set of automated tests, adaptable to new requirements." +- "The sooner you adopt a practice of designing your software in the lifecycle of your project, +the easier the development and maintenance process will." +--- + +## Introduction + +Ideally, we should have at least a rough design of our software sketched out +before we write a single line of code. +This design should be based around the requirements and the structure of the problem we are trying +to solve: what are the concepts we need to represent in our code +and what are the relationships between them. +And importantly, who will be using our software and how will they interact with it. + +As a piece of software grows, +it will reach a point where there is too much code for us to keep in mind at once. +At this point, it becomes particularly important to think of the overall design and +structure of our software, how should all the pieces of functionality fit together, +and how should we work towards fulfilling this overall design throughout development. +Even if you did not think about the design of your software from the very beginning - +it is not too late to start now. + +It is not easy to come up with a complete definition for the term **software design**, +but some of the common aspects are: + +- **Algorithm design** - + what method are we going to use to solve the core research/business problem? +- **Software architecture** - + what components will the software have and how will they cooperate? +- **System architecture** - + what other things will this software have to interact with and how will it do this? +- **UI/UX** (User Interface / User Experience) - + how will users interact with the software? + +There is literature on each of the above software design aspects - we will not go into details of +them all here. +Instead, we will learn some techniques to structure our code better to satisfy some of the +requirements of 'good' software and revisit +our software's [MVC architecture](/11-software-project/index.html#software-architecture) +in the context of software design. + +## Good Software Design Goals +Aspirationally, what makes good code can be summarised in the following quote from the +[Intent HG blog](https://intenthq.com/blog/it-audience/what-is-good-code-a-scientific-definition/): + +> *“Good code is written so that is readable, understandable, +> covered by automated tests, not over complicated +> and does well what is intended to do.”* + +Software has become a crucial aspect of reproducible research, as well as an asset that +can be reused or repurposed. +Thus, it is even more important to take time to design the software to be easily *modifiable* and +*extensible*, to save ourselves and our team a lot of time later on when we have +to fix a problem or the software's requirements change. + +Satisfying the above properties will lead to an overall software design +goal of having *maintainable* code, which is: + +* *readable* (and understandable) by developers who did not write the code, e.g. by: + * following a consistent coding style and naming conventions + * using meaningful and descriptive names for variables, functions, and classes + * documenting code to describe it does and how it may be used + * using simple control flow to make it easier to follow the code execution + * keeping functions and methods small and focused on a single task (also important for testing) +* *testable* through a set of (preferably automated) tests, e.g. by: + * writing unit, functional, regression tests to verify the code produces + the expected outputs from controlled inputs and exhibits the expected behavior over time + as the code changes +* *adaptable* (easily modifiable and extensible) to satisfy new requirements, e.g. by: + * writing low-coupled/decoupled code where each part of the code has a separate concern and + the lowest possible dependency on other parts of the code making it + easier to test, update or replace - e.g. by separating the "business logic" and "presentation" + layers of the code on the architecture level (recall the [MVC architecture](/11-software-project/index.html#software-architecture)), + or separating "pure" (without side-effects) and "impure" (with side-effects) parts of the code on the + level of functions. + +Now that we know what goals we should aspire to, let us take a critical look at the code in our +software project and try to identify ways in which it can be improved. + +Our software project contains a branch `full-data-analysis` with code for a new feature of our +inflammation analysis software. Recall that you can see all your branches as follows: +~~~ +$ git branch --all +~~~ +{: .language-bash} + +Let's checkout a new local branch from the `full-data-analysis` branch, making sure we +have saved and committed all current changes before doing so. + +~~~ +git checkout -b full-data-analysis +~~~ +{: .language-bash} + +This new feature enables user to pass a new command-line parameter `--full-data-analysis` causing +the software to find the directory containing the first input data file (provided via command line +parameter `infiles`) and invoke the data analysis over all the data files in that directory. +This bit of functionality is handled by `inflammation-analysis.py` in the project root. + +The new data analysis code is located in `compute_data.py` file within the `inflammation` directory +in a function called `analyse_data()`. +This function loads all the data files for a given a directory path, then +calculates and compares standard deviation across all the data by day and finaly plots a graph. + +> ## Exercise: Identifying How Code Can be Improved? +> Critically examine the code in `analyse_data()` function in `compute_data.py` file. +> +> In what ways does this code not live up to the ideal properties of 'good' code? +> Think about ways in which you find it hard to understand. +> Think about the kinds of changes you might want to make to it, and what would +> make making those changes challenging. +>> ## Solution +>> You may have found others, but here are some of the things that make the code +>> hard to read, test and maintain. +>> +>> * **Hard to read:** everything is implemented in a single function. +>> In order to understand it, you need to understand how file loading works at the same time as +>> the analysis itself. +>> * **Hard to modify:** if you wanted to use the data for some other purpose and not just +>> plotting the graph you would have to change the `data_analysis()` function. +>> * **Hard to modify or test:** it is always analysing a fixed set of CSV data files +>> stored on a disk. +>> * **Hard to modify:** it does not have any tests so we cannot be 100% confident the code does +>> what it claims to do; any changes to the code may break something and it would be harder and +>> more time-consuming to figure out what. +>> +>> Make sure to keep the list you have created in the exercise above. +>> For the remainder of this section, we will work on improving this code. +>> At the end, we will revisit your list to check that you have learnt ways to address each of the +>> problems you had found. +>> +>> There may be other things to improve with the code on this branch, e.g. how command line +>> parameters are being handled in `inflammation-analysis.py`, but we are focussing on +>> `analyse_data()` function for the time being. +> {: .solution} +{: .challenge} + +## Poor Design Choices & Technical Debt + +When faced with a problem that you need to solve by writing code - it may be tempted to +skip the design phase and dive straight into coding. +What happens if you do not follow the good software design and development best practices? +It can lead to accumulated 'technical debt', +which (according to [Wikipedia](https://en.wikipedia.org/wiki/Technical_debt)), +is the "cost of additional rework caused by choosing an easy (limited) solution now +instead of using a better approach that would take longer". +The pressure to achieve project goals can sometimes lead to quick and easy solutions, +which make the software become +more messy, more complex, and more difficult to understand and maintain. +The extra effort required to make changes in the future is the interest paid on the (technical) debt. +It is natural for software to accrue some technical debt, +but it is important to pay off that debt during a maintenance phase - +simplifying, clarifying the code, making it easier to understand - +to keep these interest payments on making changes manageable. + +There is only so much time available in a project. +How much effort should we spend on designing our code properly +and using good development practices? +The following [XKCD comic](https://xkcd.com/844/) summarises this tension: + +![Writing good code comic](../fig/xkcd-good-code-comic.png){: .image-with-shadow width="400px" } + +At an intermediate level there are a wealth of practices that *could* be used, +and applying suitable design and coding practices is what separates +an *intermediate developer* from someone who has just started coding. +The key for an intermediate developer is to balance these concerns +for each software project appropriately, +and employ design and development practices *enough* so that progress can be made. +It is very easy to under-design software, +but remember it is also possible to over-design software too. + +## Techniques for Improving Code + +How code is structured is important for helping people who are developing and maintaining it +to understand and update it. +By breaking down our software into components with a single responsibility, +we avoid having to rewrite it all when requirements change. +Such components can be as small as a single function, or be a software package in their own right. +These smaller components can be understood individually without having to understand +the entire codebase at once. + +### Code Refactoring + +*Code refactoring* is the process of improving the design of an existing code - +changing the internal structure of code without changing its +external behavior, with the goal of making the code more readable, maintainable, efficient or easier +to test. +This can include things such as renaming variables, reorganising +functions to avoid code duplication and increase reuse, and simplifying conditional statements. + +### Code Decoupling + +*Code decoupling* is a code design technique that involves breaking a (complex) +software system into smaller, more manageable parts, and reducing the interdependence +between these different parts of the system. +This means that a change in one part of the code usually does not require a change in the other, +thereby making its development more efficient and less error prone. + +### Code Abstraction + +*Abstraction* is the process of hiding the implementation details of a piece of +code (typically behind an interface) - i.e. the details of *how* something works are hidden away, +leaving code developers to deal only with *what* it does. +This allows developers to work with the code at a higher level +of abstraction, without needing to understand fully (or keep in mind) all the underlying +details at any given time and thereby reducing the cognitive load when programming. + +Abstraction can be achieved through techniques such as *encapsulation*, *inheritance*, and +*polymorphism*, which we will explore in the next episodes. There are other [abstraction techniques](https://en.wikipedia.org/wiki/Abstraction_(computer_science)) +available too. + +## Improving Our Software Design + +Refactoring our code to make it more decoupled and to introduce abstractions to +hide all but the relevant information about parts of the code is important for creating more +maintainable code. +It will help to keep our codebase clean, modular and easier to understand. + +Writing good code is hard and takes practise. +You may also be faced with an existing piece of code that breaks some (or all) of the +good code principles, and your job will be to improve/refactor it so that it can evolve further. +We will now look into some examples of the techniques that can help us redesign our code +and incrementally improve its quality. + +{% include links.md %} From b7765c59bf473f4d2b909772feb4a4b4d118d87f Mon Sep 17 00:00:00 2001 From: Douglas Lowe <10961945+douglowe@users.noreply.github.com> Date: Tue, 5 Mar 2024 22:31:07 +0000 Subject: [PATCH 02/12] changed project names in software design episode --- _extras/refactor-software-design.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_extras/refactor-software-design.md b/_extras/refactor-software-design.md index 31b45e411..bc7b96f9c 100644 --- a/_extras/refactor-software-design.md +++ b/_extras/refactor-software-design.md @@ -108,9 +108,9 @@ git checkout -b full-data-analysis This new feature enables user to pass a new command-line parameter `--full-data-analysis` causing the software to find the directory containing the first input data file (provided via command line parameter `infiles`) and invoke the data analysis over all the data files in that directory. -This bit of functionality is handled by `inflammation-analysis.py` in the project root. +This bit of functionality is handled by `catchment-analysis.py` in the project root. -The new data analysis code is located in `compute_data.py` file within the `inflammation` directory +The new data analysis code is located in `compute_data.py` file within the `catchment` directory in a function called `analyse_data()`. This function loads all the data files for a given a directory path, then calculates and compares standard deviation across all the data by day and finaly plots a graph. @@ -143,7 +143,7 @@ calculates and compares standard deviation across all the data by day and finaly >> problems you had found. >> >> There may be other things to improve with the code on this branch, e.g. how command line ->> parameters are being handled in `inflammation-analysis.py`, but we are focussing on +>> parameters are being handled in `catchment-analysis.py`, but we are focussing on >> `analyse_data()` function for the time being. > {: .solution} {: .challenge} From 300525986075e24067b00bcd46cf2ab4181cb8f7 Mon Sep 17 00:00:00 2001 From: Douglas Lowe <10961945+douglowe@users.noreply.github.com> Date: Tue, 5 Mar 2024 22:55:07 +0000 Subject: [PATCH 03/12] regression testing example adapted --- _extras/refactor-code-refactoring.md | 31 ++++++++++++++++++---------- 1 file changed, 20 insertions(+), 11 deletions(-) diff --git a/_extras/refactor-code-refactoring.md b/_extras/refactor-code-refactoring.md index ca3a1c451..39d0de80b 100644 --- a/_extras/refactor-code-refactoring.md +++ b/_extras/refactor-code-refactoring.md @@ -96,8 +96,8 @@ the tests at all. > > ```python > def test_analyse_data(): -> from inflammation.compute_data import analyse_data -> path = Path.cwd() / "../data" +> from catchment.compute_data import analyse_data +> path = Path.cwd() / "data" > result = analyse_data(path) > > # TODO: add an assert for the value of result @@ -105,10 +105,17 @@ the tests at all. > Use `assert_array_almost_equal` from the `numpy.testing` library to > compare arrays of floating point numbers. > +> Remember to run the test using `python -m pytest` from the project base directory: +> ```bash +> python -m pytest tests/test_analyse_data.py +> ``` +> >> ## Hint >> When determining the correct return data result to use in tests, it may be helpful to assert the >> result equals some random made-up data, observe the test fail initially and then >> copy and paste the correct result into the test. +>> +>> Remember also that NaN values can be defined using the numpy library (`numpy.nan`). > {: .solution} > >> ## Solution @@ -121,20 +128,22 @@ the tests at all. >> Putting this together, your test may look like: >> >> ```python +>> import numpy as np >> import numpy.testing as npt >> from pathlib import Path >> >> def test_analyse_data(): ->> from inflammation.compute_data import analyse_data ->> path = Path.cwd() / "../data" +>> from catchment.compute_data import analyse_data +>> path = Path.cwd() / "data" >> result = analyse_data(path) ->> expected_output = [0.,0.22510286,0.18157299,0.1264423,0.9495481,0.27118211, ->> 0.25104719,0.22330897,0.89680503,0.21573875,1.24235548,0.63042094, ->> 1.57511696,2.18850242,0.3729574,0.69395538,2.52365162,0.3179312, ->> 1.22850657,1.63149639,2.45861227,1.55556052,2.8214853,0.92117578, ->> 0.76176979,2.18346188,0.55368435,1.78441632,0.26549221,1.43938417, ->> 0.78959769,0.64913879,1.16078544,0.42417995,0.36019114,0.80801707, ->> 0.50323031,0.47574665,0.45197398,0.22070227] +>> expected_output = [[0.09133463], [0.17383042], [0.00147314], [0.00147314], +>> [0. ], [0.00294628], [0.03682848], [0.00883883], +>> [0.00147314], [0.169411 ], [0.00147314], [0. ], +>> [0.00147314], [0. ], [0. ], [0. ], +>> [0. ], [0.00294628], [0.00147314], [0.00147314], +>> [0.00147314], [0.00147314], [0. ], [ np.nan], +>> [0.00147314], [0.00147314], [0.00147314], [0.00147314], +>> [0.01473139], [0.01178511], [0.02209709]] >> npt.assert_array_almost_equal(result, expected_output) >> ``` >> From 1dd65b9ba84fa9ab3c756271e1517482081a1b9f Mon Sep 17 00:00:00 2001 From: Douglas Lowe <10961945+douglowe@users.noreply.github.com> Date: Tue, 5 Mar 2024 23:05:06 +0000 Subject: [PATCH 04/12] adapted pure function section --- _extras/refactor-code-refactoring.md | 26 ++++++++++++++------------ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/_extras/refactor-code-refactoring.md b/_extras/refactor-code-refactoring.md index 39d0de80b..0db53b64a 100644 --- a/_extras/refactor-code-refactoring.md +++ b/_extras/refactor-code-refactoring.md @@ -213,25 +213,27 @@ be harder to test but, when simplified like this, may only require a handful of >> ## Solution >> The analysis code will be refactored into a separate function that may look something like: >> ```python ->> def compute_standard_deviation_by_day(data): ->> means_by_day = map(models.daily_mean, data) ->> means_by_day_matrix = np.stack(list(means_by_day)) +>>def compute_standard_deviation_by_day(data): +>> means_by_day = map(models.daily_mean, data) +>> means_by_day_matrix = pd.concat(means_by_day) >> ->> daily_standard_deviation = np.std(means_by_day_matrix, axis=0) ->> return daily_standard_deviation +>> daily_standard_deviation = pd.DataFrame(means_by_day_matrix.std(axis=1), columns=['std']) +>> return daily_standard_deviation >> ``` >> The `analyse_data()` function now calls the `compute_standard_deviation_by_day()` function, >> while keeping all the logic for reading the data, processing it and showing it in a graph: >>```python >>def analyse_data(data_dir): ->> """Calculate the standard deviation by day between datasets ->> Gets all the inflammation csvs within a directory, works out the mean ->> inflammation value for each day across all datasets, then graphs the ->> standard deviation of these means.""" ->> data_file_paths = glob.glob(os.path.join(data_dir, 'inflammation*.csv')) +>> """Calculate the standard deviation by day between datasets. +>> +>> Gets all the measurement data from the CSV files in the data directory, +>> works out the mean for each day, and then graphs the standard deviation +>> of these means. +>> """ +>> data_file_paths = glob.glob(os.path.join(data_dir, 'rain_data_2015*.csv')) >> if len(data_file_paths) == 0: ->> raise ValueError(f"No inflammation csv's found in path {data_dir}") ->> data = map(models.load_csv, data_file_paths) +>> raise ValueError('No CSV files found in the data directory') +>> data = map(models.read_variable_from_csv, data_file_paths) >> daily_standard_deviation = compute_standard_deviation_by_day(data) >> >> graph_data = { From 7b1f27be3cc50acd0f9654dbecb45b5f009975f1 Mon Sep 17 00:00:00 2001 From: Douglas Lowe <10961945+douglowe@users.noreply.github.com> Date: Wed, 6 Mar 2024 19:37:42 +0000 Subject: [PATCH 05/12] rename refactor episode files, and add alt arch revisited episode --- _config.yml | 7 +- ...esign.md => refactor-1-software-design.md} | 0 ...ring.md => refactor-2-code-refactoring.md} | 0 ...ons.md => refactor-3-code-abstractions.md} | 0 _extras/refactor-4-architecture-revisited.md | 698 ++++++++++++++++++ 5 files changed, 702 insertions(+), 3 deletions(-) rename _extras/{refactor-software-design.md => refactor-1-software-design.md} (100%) rename _extras/{refactor-code-refactoring.md => refactor-2-code-refactoring.md} (100%) rename _extras/{refactor-code-abstractions.md => refactor-3-code-abstractions.md} (100%) create mode 100644 _extras/refactor-4-architecture-revisited.md diff --git a/_config.yml b/_config.yml index 8823b6926..8d159fdfd 100644 --- a/_config.yml +++ b/_config.yml @@ -92,11 +92,12 @@ extras_order: - figures - guide - common-issues + - refactor-1-software-design + - refactor-2-code-refactoring + - refactor-3-code-abstractions + - refactor-4-architecture-revisited - protect-main-branch - vscode - - refactor-software-design - - refactor-code-refactoring - - refactor-code-abstractions - persistence - databases - geopandas diff --git a/_extras/refactor-software-design.md b/_extras/refactor-1-software-design.md similarity index 100% rename from _extras/refactor-software-design.md rename to _extras/refactor-1-software-design.md diff --git a/_extras/refactor-code-refactoring.md b/_extras/refactor-2-code-refactoring.md similarity index 100% rename from _extras/refactor-code-refactoring.md rename to _extras/refactor-2-code-refactoring.md diff --git a/_extras/refactor-code-abstractions.md b/_extras/refactor-3-code-abstractions.md similarity index 100% rename from _extras/refactor-code-abstractions.md rename to _extras/refactor-3-code-abstractions.md diff --git a/_extras/refactor-4-architecture-revisited.md b/_extras/refactor-4-architecture-revisited.md new file mode 100644 index 000000000..82524d23b --- /dev/null +++ b/_extras/refactor-4-architecture-revisited.md @@ -0,0 +1,698 @@ +--- +title: "Refactor 4: Architecture Revisited: Extending Software" +teaching: 15 +exercises: 0 +questions: +- "How can we extend our software within the constraints of the MVC architecture?" +objectives: +- "Extend our software to add a view of a single patient in the study and the software's command line interface to request a specific view." +keypoints: +- "By breaking down our software into components with a single responsibility, we avoid having to rewrite it all when requirements change. + Such components can be as small as a single function, or be a software package in their own right." +--- + +As we have seen, we have different programming paradigms that are suitable for different problems +and affect the structure of our code. +In programming languages that support multiple paradigms, such as Python, +we have the luxury of using elements of different paradigms paradigms and we, +as software designers and programmers, +can decide how to use those elements in different architectural components of our software. +Let's now circle back to the architecture of our software for one final look. + +## MVC Revisited + +We've been developing our software using the **Model-View-Controller** (MVC) architecture so far, +but, as we have seen, MVC is just one of the common architectural patterns +and is not the only choice we could have made. + +There are many variants of an MVC-like pattern (such as +[Model-View-Presenter](https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93presenter) (MVP), +[Model-View-Viewmodel](https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93viewmodel) (MVVM), etc.), +but in most cases, the distinction between these patterns isn't particularly important. +What really matters is that we are making decisions about the architecture of our software +that suit the way in which we expect to use it. +We should reuse these established ideas where we can, but we don't need to stick to them exactly. + +In this episode we'll be taking our Object Oriented code from the previous episode +and integrating it into our existing MVC pattern. +But first we will explain some features of +the Controller (`catchment-analysis.py`) component of our architecture. + +### Controller file structure + +You will have noticed already that structure of the `catchment-analysis.py` file +follows this pattern: + +~~~ +# import modules + +def main(): + # perform some actions + +if __name__ == "__main__": + # perform some actions before main() + main() +~~~ +{: .language-python} + +In this pattern the actions performed by the script are contained within the `main` function +(which does not need to be called `main`, +but using this convention helps others in understanding your code). +The `main` function is then called within the `if` statement `__name__ == "__main__"`, +after some other actions have been performed +(usually the parsing of command-line arguments, which will be explained below). +`__name__` is a special dunder variable which is set, +along with a number of other special dunder variables, +by the python interpreter before the execution of any code in the source file. +What value is given by the interpreter to `__name__` is determined by +the manner in which it is loaded. + +If we run the source file directly using the Python interpreter, e.g.: + +~~~ +python catchment-analysis.py +~~~ +{: .language-bash} +then the interpreter will assign the hard-coded string `"__main__"` to the `__name__` variable: + +~~~ +__name__ = "__main__" +... +# rest of your code +~~~ +{: .language-python} + +However, if your source file is imported by another Python script, e.g: + +~~~ +import catchment-analysis +~~~ +{: .language-python} + +then the interpreter will assign the name `"catchment-analysis"` +from the import statement to the `__name__` variable: + +~~~ +__name__ = "catchment-analysis" +... +# rest of your code +~~~ +{: .language-python} + +Because of this behaviour of the interpreter, +we can put any code that should only be executed when running the script +directly within the `if __name__ == "__main__":` structure, +allowing the rest of the code within the script to be +safely imported by another script if we so wish. + +While it may not seem very useful to have your controller script importable by another script, +there are a number of situations in which you would want to do this: + +- for testing of your code, you can have your testing framework import the main script, + and run special test functions which then call the `main` function directly; +- where you want to not only be able to run your script from the command-line, + but also provide a programmer-friendly application programming interface (API) for advanced users. + +### Passing Command-line Options to Controller + +The standard python library for reading command line arguments passed to a script is +[`argparse`](https://docs.python.org/3/library/argparse.html). +This module reads arguments passed by the system, +and enables the automatic generation of help and usage messages. +These include, as we saw at the start of this course, +the generation of helpful error messages when users give the program invalid arguments. + +The basic usage of `argparse` can be seen in the `catchment-analysis.py` script. +First we import the library: + +~~~ +import argparse +~~~ +{: .language-python} + +We then initialise the argument parser class, passing an (optional) description of the program: + +~~~ +parser = argparse.ArgumentParser( + description='A basic environmental data management system') +~~~ +{: .language-python} + +Once the parser has been initialised we can add +the arguments that we want argparse to look out for. +In our basic case, we want only the names of the file(s) to process: + +~~~ +parser.add_argument( + 'infiles', + nargs='+', + help='Input CSV(s) containing measurement data') +~~~ +{: .language-python} + +Here we have defined what the argument will be called (`'infiles'`) when it is read in; +the number of arguments to be expected +(`nargs='+'`, where `'+'` indicates that there should be 1 or more arguments passed); +and a help string for the user +(`help='Input CSV(s) containing measurement data'`). + +You can add as many arguments as you wish, +and these can be either mandatory (as the one above) or optional. +Most of the complexity in using `argparse` is in adding the correct argument options, +and we will explain how to do this in more detail below. + +Finally we parse the arguments passed to the script using: + +~~~ +args = parser.parse_args() +~~~ +{: .language-python} + +This returns an object (that we've called `arg`) containing all the arguments requested. +These can be accessed using the names that we have defined for each argument, +e.g. `args.infiles` would return the filenames that have been input. + +The help for the script can be accessed using the `-h` or `--help` optional argument +(which `argparse` includes by default): + +~~~ +python catchment-analysis.py --help +~~~ +{: .language-bash} +~~~ +usage: catchment-analysis.py [-h] infiles [infiles ...] + +A basic environmental data management system + +positional arguments: + infiles Input CSV(s) containing measurement data + +optional arguments: + -h, --help show this help message and exit +~~~ +{: .output} + +The help page starts with the command line usage, +illustrating what inputs can be given (any within `[]` brackets are optional). +It then lists the **positional** and **optional** arguments, +giving as detailed a description of each as you have added to the `add_argument()` command. +Positional arguments are arguments that need to be included +in the proper position or order when calling the script. + +Note that optional arguments are indicated by `-` or `--`, followed by the argument name. +Positional arguments are simply inferred by their position. +It is possible to have multiple positional arguments, +but usually this is only practical where all (or all but one) positional arguments +contains a clearly defined number of elements. +If more than one option can have an indeterminate number of entries, +then it is better to create them as 'optional' arguments. +These can be made a required input though, +by setting `required = True` within the `add_argument()` command. + +> ## Positional and Optional Argument Order +> +> The usage section of the help page above shows +> the optional arguments going before the positional arguments. +> This is the customary way to present options, but is not mandatory. +> Instead there are two rules which must be followed for these arguments: +> +> 1. Positional and optional arguments must each be given all together, and not inter-mixed. +> For example, the order can be either `optional - positional` or `positional - optional`, +> but not `optional - positional - optional`. +> 2. Positional arguments must be given in the order that they are shown +> in the usage section of the help page. +{: .callout} + +Now that you have some familiarity with `argparse`, +we will demonstrate below how you can use this to add extra functionality to your controller. + +### Choosing the Measurement Dataseries + +Up until now we have only read the rainfall data from our `data/rain_data_2015-12.csv` file. +But what if we want to read the river measurement data too? +We can, simply, change the file that we are reading, +by passing a different file name. +But when we do this with the river data we get the following error: +~~~ +python catchment-analysis.py data/river_data_2015-12.csv +~~~ +{: .language-bash} +~~~ +Traceback (most recent call last): + File "/Users/mbessdl2/work/manchester/Course_Material/Intermediate_Programming_Skills/python-intermediate-rivercatchment-template/catchment-analysis.py", line 39, in + main(args) + File "/Users/mbessdl2/work/manchester/Course_Material/Intermediate_Programming_Skills/python-intermediate-rivercatchment-template/catchment-analysis.py", line 22, in main + measurement_data = models.read_variable_from_csv(filename) + File "/Users/mbessdl2/work/manchester/Course_Material/Intermediate_Programming_Skills/python-intermediate-rivercatchment-template/catchment/models.py", line 22, in read_variable_from_csv + dataset = pd.read_csv(filename, usecols=['Date', 'Site', 'Rainfall (mm)']) +... +ValueError: Usecols do not match columns, columns expected but not found: ['Rainfall (mm)'] +~~~ +{: .output} + +This error message tells us that the pandas `read_csv` function +has failed to find one of the columns that are listed to be read. +We would not expect a column called `'Rainfall (mm)'` in the river data file, +so we need to make the `read_variable_from_csv` more flexible, +so that it can read any defined measurement dataset. + +The first step is to add an argument to our command line interface, +so that users can specify the measurement dataset. +This can be done by adding the following argument to your `catchment-analysis.py` script: +~~~ + parser.add_argument( + '-m', '--measurements', + help = 'Name of measurement data series to load', + required = True) +~~~ +{: .language-python} +Here we have defined the name of the argument (`--measurements`), +as well as a short name (`-m`) for lazy users to use. +Note that the short name is preceded by a single dash (`-`), +while the full name is preceded by two dashes (`--`). +We provide a `help` string for the user, +and finally we set `required = True`, +so that the end user must define which data series they want to read. + +Once this is added, then your help message should look like this: +~~~ +python catchment-analysis.py --help +~~~ +{: .language-bash} +~~~ +usage: catchment-analysis.py [-h] -m MEASUREMENTS infiles [infiles ...] + +A basic environmental data management system + +positional arguments: + infiles Input CSV(s) containing measurement data + +optional arguments: + -h, --help show this help message and exit + -m MEASUREMENTS, --measurements MEASUREMENTS + Name of measurement data series to use +~~~ +{: .output} + +> ## Optional vs Required Arguments, and Argument Groups +> You will note that the `--measurements` argument is still listed as an optional argument. +> This is because the two basic option groups in `argparse` are +> positional and optional. +> In the usage section the `--measurements` option is listed without `[]` brackets, +> indicating that it is an expected argument, +> but still this is not very clear for end users. +> +> To make the help clearer we can add an extra argument group, +> and assign `--measurements` to this: +> ~~~ +> ... +> req_group = parser.add_argument_group('required arguments') +> ... +> req_group.add_argument( +> '-m', '--measurements', +> help = 'Name of measurement data series to load', +> required = True) +> ... +> ~~~ +> {: .language-python} +> This will return the following help message: +> ~~~ +> python catchment-analysis.py --help +> ~~~ +> {: .language-bash} +> ~~~ +> usage: catchment-analysis.py [-h] -m MEASUREMENTS infiles [infiles ...] +> +> A basic environmental data management system +> +> positional arguments: +> infiles Input CSV(s) containing measurement data +> +> optional arguments: +> -h, --help show this help message and exit +> +> required arguments: +> -m MEASUREMENTS, --measurements MEASUREMENTS +> Name of measurement data series to use +> ~~~ +> {: .output} +> This solution is not perfect, because the positional arguments are also required, +> but it will at least help end users distinguish between optional and required flagged arguments. +{: .callout} + +> ## Default Argument Number and Type +> `argparse` will, by default, assume that each argument added will take a single value, +> and will be a string (`type = str`). If you want to change this for any argument you +> should explicitly set `type` and `nargs`. +> +> Note also, that the returned object will be a single item unless `nargs` has been set, +> in which case a list of items is returned (even if `nargs = 1` is used). +{: .callout} + + +#### Controller and Model Adaption + +The new measurement string needs to be passed to the `read_variable_from_csv` function, +and applied appropriately within that function. +First we add a `measurements` argument to the `read_variable_from_csv` function in `catchment/models.py` +(remembering to update the function docstring at the same time): +~~~ +# catchment/models.py +... +def read_variable_from_csv(filename, measurement): + """Reads a named variable from a CSV file, and returns a + pandas dataframe containing that variable. The CSV file must contain + a column of dates, a column of site ID's, and (one or more) columns + of data - only one of which will be read. + + :param filename: Filename of CSV to load + :param measurement: Name of data column to be read + :return: 2D array of given variable. Index will be dates, + Columns will be the individual sites + """ +... +~~~ +{: .language-python} +Following this we need to change two lines of code, +the first being the CSV reading code, +and the second being the code which reorganises the dataset before it is returned: +~~~ +# catchment/models.py +... +def read_variable_from_csv(filename, measurement): +... + dataset = pd.read_csv(filename, usecols=['Date', 'Site', measurement]) +... + for site in dataset['Site'].unique(): + newdataset[site] = dataset[dataset['Site'] == site].set_index('Date')[measurement] +... +~~~ +{: .language-python} + + +Finally, within the `main` function of the controller we should add `args.measurements` as an argument: +~~~ +# catchment-analysis.py +... +def main(args): +... + for filename in in_files: + measurement_data = models.read_variable_from_csv(filename, args.measurements) +... +~~~ +{: .language-python} + +You can now test your new code, to ensure it works as expected: +~~~ +python catchment-analysis.py -m 'Rainfall (mm)' data/rain_data_2015-12.csv +~~~ +{: .language-bash} +![Rainfall daily metrics](../fig/rainfall_daily_metrics.png){: .image-with-shadow width="800px" } + +~~~ +python catchment-analysis.py -m 'pH continuous' data/river_data_2015-12.csv +~~~ +{: .language-bash} +![River pH daily metrics](../fig/pH_daily_metrics.png){: .image-with-shadow width="800px" } + +Note that we have to use quotation marks to +pass any strings which contain spaces or special characters, +so that they are properly read by the parser. + + + +### Adding a new View + +Now that we can select the data we require, +let's add a view that allows us to see the data for a single site. +First, we need to add the code for the view itself +and make sure our `Site` class has the necessary data - +including the ability to pass a list of measurements to the `__init__` method. +Note that your Site class may look very different now, +so adapt this example to fit what you have. + +~~~ python +# file: catchment/views.py + +... + +def display_measurement_record(site): + """Display each dataset for a single site.""" + print(site.name) + for measurement in site.measurements: + print(site.measurements[measurement].series) +~~~ +{: .language-python} + +~~~ python +# file: catchment/models.py + +... + +class MeasurementSeries: + def __init__(self, series, name, units): + self.series = series + self.name = name + self.units = units + self.series.name = self.name + + def add_measurement(self, data): + self.series = pd.concat([self.series,data]) + self.series.name = self.name + + def __str__(self): + if self.units: + return f"{self.name} ({self.units})" + else: + return self.name + +class Location: + def __init__(self, name): + self.name = name + + def __str__(self): + return self.name + +class Site(Location): + def __init__(self,name): + super().__init__(name) + self.measurements = {} + + def add_measurement(self, measurement_id, data, units=None): + if measurement_id in self.measurements.keys(): + self.measurements[measurement_id].add_measurement(data) + + else: + self.measurements[measurement_id] = MeasurementSeries(data, measurement_id, units) + + @property + def last_measurements(self): + return pd.concat( + [self.measurements[key].series[-1:] for key in self.measurements.keys()], + axis=1).sort_index() + +~~~ +{: .language-python} + +Now we need to make sure people can call this view - +that means connecting it to the controller +and ensuring that there's a way to request this view when running the program. + +#### Adapting the Controller + +The changes we need to make here are that the `main` function +needs to be able to direct us to the view we've requested - +and we need to add to the command line interface - the controller - +the necessary data to drive the new view. + +As the argument parsing routines are getting more involved, we have moved these into a +single function (`parse_cli_arguments`), to make the script more readable. +~~~ +# file: catchment-analysis.py + +#!/usr/bin/env python3 +"""Software for managing measurement data for our catchment project.""" + +import argparse + +from catchment import models, views + + +def main(args): + """The MVC Controller of the patient data system. + + The Controller is responsible for: + - selecting the necessary models and views for the current task + - passing data between models and views + """ + infiles = args.infiles + if not isinstance(infiles, list): + infiles = [args.infiles] + + for filename in in_files: + measurement_data = models.read_variable_from_csv(filename, arg.measurements) + + + ### MODIFIED START ### + if args.view == 'visualize': + view_data = {'daily sum': models.daily_total(measurement_data), + 'daily average': models.daily_mean(measurement_data), + 'daily max': models.daily_max(measurement_data), + 'daily min': models.daily_min(measurement_data)} + + views.visualize(view_data) + + elif args.view == 'record': + measurement_data = measurement_data[args.site] + site = models.Site(args.site) + site.add_measurement(arg.measurements, measurement_data) + + views.display_measurement_record(site) + ### MODIFIED END ### + + +def parse_cli_arguments(): + """Definitions and logic tests for the CLI argument parser""" + + parser = argparse.ArgumentParser( + description='A basic environmental data management system') + + req_group = parser.add_argument_group('required arguments') + + parser.add_argument( + 'infiles', + nargs = '+', + help = 'Input CSV(s) containing measurement data') + + req_group.add_argument( + '-m', '--measurements', + help = 'Name of measurement data series to load', + required = True) + + ### MODIFIED START ### + parser.add_argument( + '--view', + default = 'visualize', + choices = ['visualize', 'record'], + help = 'Which view should be used?') + + parser.add_argument( + '--site', + type = str, + default = None, + help = 'Which site should be displayed?') + ### MODIFIED END ### + + args = parser.parse_args() + + if args.view == 'record' and args.site is None: + parser.error("'record' --view requires that --site is set") + + return args + + +if __name__ == "__main__": + + args = parse_cli_arguments() + + main(args) +~~~ +{: .language-python} + +We've added two options to our command line interface here: +one to request a specific view (`--view`) +and one for the site ID that we want to lookup (`--site`). +Note that both are optional, +but have `default` values if they are not set. +For the view option, +the default is for the graphic `visualize` view, +and we have set a defined list of `choices` that users are allowed to specify. +For the site option the default value is `None`. +We have added an `if` statement after the arguments are parsed, +but before calling the `main` function, +to ensure that the site option is set if we are using the `record` view, +which will return an error using the `parser.error` function: +~~~ +python3 catchment-analysis.py --view record -m 'Rainfall (mm)' data/rain_data_2015-12.csv +~~~ +{: .language-bash} +~~~ +usage: catchment-analysis.py [-h] -m MEASUREMENTS [--view {visualize,record}] [--site SITE] infiles [infiles ...] +catchment-analysis.py: error: 'record' --view requires that --site is set +~~~ +{: .output} +Because we used the `parser.error` function, +the usage information for the command is given, +followed by the error message that we have added. + +We can now call our program with these extra arguments to see the record for a single site: + +~~~ +$ python3 catchment-analysis.py --view record --site FP35 -m 'Rainfall (mm)' data/rain_data_2015-12.csv +~~~ +{: .language-bash} + +~~~ +FP35 +2005-12-01 00:00:00 0.0 +2005-12-01 00:15:00 0.0 +2005-12-01 00:30:00 0.0 +2005-12-01 00:45:00 0.0 +2005-12-01 01:00:00 0.0 + ... +2005-12-31 22:45:00 0.2 +2005-12-31 23:00:00 0.0 +2005-12-31 23:15:00 0.2 +2005-12-31 23:30:00 0.2 +2005-12-31 23:45:00 0.0 +Name: Rainfall, Length: 2976, dtype: float64 +~~~ +{: .output} + + +For the full range of features that we have access to with `argparse` see the +[Python module documentation](https://docs.python.org/3/library/argparse.html?highlight=argparse#module-argparse). +Allowing the user to request a specific view like this is +a similar model to that used by the popular Python library Click - +if you find yourself needing to build more complex interfaces than this, +Click would be a good choice. +You can find more information in [Click's documentation](https://click.palletsprojects.com/). + + +> ## Additional Material +> +> Now that we've covered the basics of different programming paradigms +> and how we can integrate them into our multi-layer architecture, +> there are two optional extra episodes which you may find interesting. +> +> Both episodes cover the persistence layer of software architectures +> and methods of persistently storing data, but take different approaches. +> The episode on [persistence with JSON](../persistence) covers +> some more advanced concepts in Object Oriented Programming, while +> the episode on [databases](../databases) starts to build towards a true multilayer architecture, +> which would allow our software to handle much larger quantities of data. +{: .callout} + + +## Towards Collaborative Software Development + +Having looked at some theoretical aspects of software design, +we are now circling back to implementing our software design +and developing our software to satisfy the requirements collaboratively in a team. +At an intermediate level of software development, +there is a wealth of practices that could be used, +and applying suitable design and coding practices is what separates +an intermediate developer from someone who has just started coding. +The key for an intermediate developer is to balance these concerns +for each software project appropriately, +and employ design and development practices enough so that progress can be made. + +One practice that should always be considered, +and has been shown to be very effective in team-based software development, +is that of *code review*. +Code reviews help to ensure the 'good' coding standards are achieved +and maintained within a team by having multiple people +have a look and comment on key code changes to see how they fit within the codebase. +Such reviews check the correctness of the new code, test coverage, functionality changes, +and confirm that they follow the coding guides and best practices. +Let's have a look at some code review techniques available to us. From 2c35bfd902ee6a7e9797e01181046c574760ec07 Mon Sep 17 00:00:00 2001 From: Douglas Lowe <10961945+douglowe@users.noreply.github.com> Date: Wed, 6 Mar 2024 19:44:48 +0000 Subject: [PATCH 06/12] add example command, and expand on file access problem --- _extras/refactor-1-software-design.md | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/_extras/refactor-1-software-design.md b/_extras/refactor-1-software-design.md index bc7b96f9c..4941bb4a3 100644 --- a/_extras/refactor-1-software-design.md +++ b/_extras/refactor-1-software-design.md @@ -108,12 +108,15 @@ git checkout -b full-data-analysis This new feature enables user to pass a new command-line parameter `--full-data-analysis` causing the software to find the directory containing the first input data file (provided via command line parameter `infiles`) and invoke the data analysis over all the data files in that directory. -This bit of functionality is handled by `catchment-analysis.py` in the project root. +This bit of functionality is handled by `catchment-analysis.py` in the project root. E.g. +```bash +python catchment-analysis.py data/rain_data_small.csv --full-data-analysis +``` The new data analysis code is located in `compute_data.py` file within the `catchment` directory in a function called `analyse_data()`. This function loads all the data files for a given a directory path, then -calculates and compares standard deviation across all the data by day and finaly plots a graph. +calculates and compares standard deviation across all the data by day and finally plots a graph. > ## Exercise: Identifying How Code Can be Improved? > Critically examine the code in `analyse_data()` function in `compute_data.py` file. @@ -131,8 +134,8 @@ calculates and compares standard deviation across all the data by day and finaly >> the analysis itself. >> * **Hard to modify:** if you wanted to use the data for some other purpose and not just >> plotting the graph you would have to change the `data_analysis()` function. ->> * **Hard to modify or test:** it is always analysing a fixed set of CSV data files ->> stored on a disk. +>> * **Hard to modify or test:** it always analyses a fixed set of CSV data files +>> within whichever directory it accesses, not always the file that is given as an argument. >> * **Hard to modify:** it does not have any tests so we cannot be 100% confident the code does >> what it claims to do; any changes to the code may break something and it would be harder and >> more time-consuming to figure out what. From d2bfb95b3988560d7c1060e4ccb9226c20268498 Mon Sep 17 00:00:00 2001 From: Douglas Lowe <10961945+douglowe@users.noreply.github.com> Date: Wed, 6 Mar 2024 20:15:37 +0000 Subject: [PATCH 07/12] added mapreduce text and exercise --- _extras/refactor-2-code-refactoring.md | 70 ++++++++++++++++++++++++-- 1 file changed, 66 insertions(+), 4 deletions(-) diff --git a/_extras/refactor-2-code-refactoring.md b/_extras/refactor-2-code-refactoring.md index 0db53b64a..5875737a9 100644 --- a/_extras/refactor-2-code-refactoring.md +++ b/_extras/refactor-2-code-refactoring.md @@ -107,7 +107,7 @@ the tests at all. > > Remember to run the test using `python -m pytest` from the project base directory: > ```bash -> python -m pytest tests/test_analyse_data.py +> python -m pytest tests/test_compute_data.py > ``` > >> ## Hint @@ -214,10 +214,12 @@ be harder to test but, when simplified like this, may only require a handful of >> The analysis code will be refactored into a separate function that may look something like: >> ```python >>def compute_standard_deviation_by_day(data): ->> means_by_day = map(models.daily_mean, data) ->> means_by_day_matrix = pd.concat(means_by_day) +>> daily_std_list = [] +>> for dataset in data: +>> daily_std = dataset.groupby(dataset.index.date).std() +>> daily_std_list.append(daily_std) >> ->> daily_standard_deviation = pd.DataFrame(means_by_day_matrix.std(axis=1), columns=['std']) +>> daily_standard_deviation = pd.concat(daily_std_list) >> return daily_standard_deviation >> ``` >> The `analyse_data()` function now calls the `compute_standard_deviation_by_day()` function, @@ -247,6 +249,66 @@ be harder to test but, when simplified like this, may only require a handful of > {: .solution} {: .challenge} +### MapReduce Data Processing Approach + +When working with data you will often find that you need to +apply a transformation to each datapoint of a dataset +and then perform some aggregation across the whole dataset. +One instance of this data processing approach is known as MapReduce +and is applied when processing (but not limited to) Big Data, +e.g. using tools such as [Spark](https://en.wikipedia.org/wiki/Apache_Spark) +or [Hadoop](https://hadoop.apache.org/). +The name MapReduce comes from applying an operation to (mapping) each value in a dataset, +then performing a reduction operation which +collects/aggregates all the individual results together to produce a single result. +MapReduce relies heavily on composability and parallelisability of functional programming - +both map and reduce can be done in parallel and on smaller subsets of data, +before aggregating all intermediate results into the final result. + +> ## Exercise: Mapping +> `map(f, C)` is a function that takes another function `f()` +> and a collection `C` of data items as inputs. +> Calling `map(f, C)` applies the function `f(x)` to every data item `x` in a collection `C` +> and returns the resulting values as a new collection of the same size. +> +> First identify a line of code in the `analyse_data` function which uses the `map` function. +>> ## Solution +>> The `map` function is used with the `read_variables_from_csv` function in the `catchment/models.py` module. +>> It creates a collection of dataframes containing the data within files defined in the list `data_file_paths`: +>> ```python +>> data = map(models.read_variable_from_csv, data_file_paths) +>> ``` +> {: .solution} +> +> Now create a pure function, `daily_std`, to calculate the standard deviation by day for any dataframe. +> This can take form similar to the `daily_mean` and `daily_max` functions in the `catchment/models.py` file. +> +> Then replace the `for` loop below, that is in your `compute_standard_deviation_by_day` function, +> with a `map()` function that uses the `daily_std` function to calculate the daily standard +> deviation. +> ```python +> daily_std_list = [] +> for dataset in data: +> daily_std = dataset.groupby(dataset.index.date).std() +> daily_std_list.append(daily_std) +> ``` +>> ## Solution +>> The final functions could look like: +>> ```python +>> def daily_std(data): +>> return data.groupby(data.index.date).std() +>> +>> +>> def compute_standard_deviation_by_day(data): +>> daily_std_list = map(daily_std, data) +>> +>> daily_standard_deviation = pd.concat(daily_std_list) +>> return daily_standard_deviation +>> ``` +>> +> {: .solution} +{: .challenge} + ### Testing Pure Functions Now we have our analysis implemented as a pure function, we can write tests that cover From 07b88dc9c31f064e09a7497458a1d6ce0a77a1f8 Mon Sep 17 00:00:00 2001 From: Douglas Lowe <10961945+douglowe@users.noreply.github.com> Date: Wed, 6 Mar 2024 20:43:53 +0000 Subject: [PATCH 08/12] updated regression test and added pure function test --- _extras/refactor-2-code-refactoring.md | 80 ++++++++++++++++++++------ 1 file changed, 61 insertions(+), 19 deletions(-) diff --git a/_extras/refactor-2-code-refactoring.md b/_extras/refactor-2-code-refactoring.md index 5875737a9..b1c38b30e 100644 --- a/_extras/refactor-2-code-refactoring.md +++ b/_extras/refactor-2-code-refactoring.md @@ -136,14 +136,37 @@ the tests at all. >> from catchment.compute_data import analyse_data >> path = Path.cwd() / "data" >> result = analyse_data(path) ->> expected_output = [[0.09133463], [0.17383042], [0.00147314], [0.00147314], ->> [0. ], [0.00294628], [0.03682848], [0.00883883], ->> [0.00147314], [0.169411 ], [0.00147314], [0. ], ->> [0.00147314], [0. ], [0. ], [0. ], ->> [0. ], [0.00294628], [0.00147314], [0.00147314], ->> [0.00147314], [0.00147314], [0. ], [ np.nan], ->> [0.00147314], [0.00147314], [0.00147314], [0.00147314], ->> [0.01473139], [0.01178511], [0.02209709]] +>> expected_output = [ [0. , 0.18801829], +>> [0.10978448, 0.43107373], +>> [0.06066156, 0.0699624 ], +>> [0. , 0.02041241], +>> [0. , 0. ], +>> [0. , 0.02871518], +>> [0. , 0.17227833], +>> [0. , 0.04866643], +>> [0. , 0.02041241], +>> [0.88952727, 0. ], +>> [0. , 0.02041241], +>> [0. , 0. ], +>> [0.02041241, 0. ], +>> [0. , 0. ], +>> [0. , 0. ], +>> [0. , 0. ], +>> [0. , 0. ], +>> [0.0349812 , 0.02041241], +>> [0.02871518, 0.02041241], +>> [0.02041241, 0. ], +>> [0.02041241, 0. ], +>> [0. , 0.02041241], +>> [0. , 0. ], +>> [0. , np.nan], +>> [0.02041241, 0. ], +>> [0. , 0.02041241], +>> [0. , 0.02041241], +>> [0.02041241, 0. ], +>> [0.13449059, 0. ], +>> [0.18285024, 0.19707288], +>> [0.19176008, 0.13915472]] >> npt.assert_array_almost_equal(result, expected_output) >> ``` >> @@ -324,22 +347,41 @@ from CSV to JSON, the bulk of the tests need not be updated > ## Exercise: Testing a Pure Function > Add tests for `compute_standard_deviation_by_data()` that check for situations -> when there is only one file with multiple rows, -> multiple files with one row, and any other cases you can think of that should be tested. +> when there is only one file with multiple sites, +> multiple files with one site, and any other cases you can think of that should be tested. >> ## Solution >> You might have thought of more tests, but we can easily extend the test by parametrizing >> with more inputs and expected outputs: >> ```python ->>@pytest.mark.parametrize('data,expected_output', [ ->> ([[[0, 1, 0], [0, 2, 0]]], [0, 0, 0]), ->> ([[[0, 2, 0]], [[0, 1, 0]]], [0, math.sqrt(0.25), 0]), ->> ([[[0, 1, 0], [0, 2, 0]], [[0, 1, 0], [0, 2, 0]]], [0, 0, 0]) ->>], ->>ids=['Two patients in same file', 'Two patients in different files', 'Two identical patients in two different files']) +>>@pytest.mark.parametrize( +>> "data, expected_output", +>> [ +>> ( +>> [pd.DataFrame(data=[ [1.0, 0.0], [3.0, 4.0], [5.0, 8.0] ], +>> index=[ pd.to_datetime('2000-01-01 01:00'), +>> pd.to_datetime('2000-01-01 02:00'), +>> pd.to_datetime('2000-01-01 03:00') ], +>> columns=[ 'A', 'B' ])], +>> [ [2.0, 4.0] ] +>> ), +>> ( +>> [pd.DataFrame(data=[ 1.0, 3.0, 5.0 ], +>> index=[ pd.to_datetime('2000-01-01 01:00'), +>> pd.to_datetime('2000-01-01 02:00'), +>> pd.to_datetime('2000-01-01 03:00') ], +>> columns=['A']), +>> pd.DataFrame(data=[ 0.0, 4.0, 8.0 ], +>> index=[ pd.to_datetime('2000-01-01 01:00'), +>> pd.to_datetime('2000-01-01 02:00'), +>> pd.to_datetime('2000-01-01 03:00') ], +>> columns=['B']) ], +>> [ [2.0, 4.0] ] +>> ) +>> ], ids=["two datasets in same dataframe", "two datasets in two different dataframes"]) >>def test_compute_standard_deviation_by_day(data, expected_output): ->> from inflammation.compute_data import compute_standard_deviation_by_data +>> from catchment.compute_data import compute_standard_deviation_by_day >> ->> result = compute_standard_deviation_by_data(data) +>> result = compute_standard_deviation_by_day(data) >> npt.assert_array_almost_equal(result, expected_output) ``` > {: .solution} @@ -351,7 +393,7 @@ from CSV to JSON, the bulk of the tests need not be updated > Some programming languages, such as Haskell or Lisp, support writing pure functional code only. > Other languages, such as Python, Java, C++, allow mixing **functional** and **procedural** > programming paradigms. -> Read more in the [extra episode on functional programming](/functional-programming/index.html) +> Read more in the [extra episode on functional programming](/34-functional-programming/index.html) > and when it can be very useful to switch to this paradigm > (e.g. to employ MapReduce approach for data processing). {: .callout} From fbc4e9915095e99f3e1993ed24fe1250c617cab4 Mon Sep 17 00:00:00 2001 From: Douglas Lowe <10961945+douglowe@users.noreply.github.com> Date: Fri, 8 Mar 2024 00:11:12 +0000 Subject: [PATCH 09/12] code abstractions adapted for catchment code --- _extras/refactor-3-code-abstractions.md | 110 +++++++++++++----------- 1 file changed, 62 insertions(+), 48 deletions(-) diff --git a/_extras/refactor-3-code-abstractions.md b/_extras/refactor-3-code-abstractions.md index 409f8312b..ff133f9d4 100644 --- a/_extras/refactor-3-code-abstractions.md +++ b/_extras/refactor-3-code-abstractions.md @@ -41,22 +41,22 @@ let's decouple the data loading into a separate function. > ## Exercise: Decouple Data Loading from Data Analysis > Separate out the data loading functionality from `analyse_data()` into a new function -> `load_inflammation_data()` that returns all the files to load. +> `load_catchment_data()` that returns all the files to load. >> ## Solution ->> The new function `load_inflammation_data()` that reads all the data into the format needed +>> The new function `load_catchment_data()` that reads all the data into the format needed >> for the analysis should look something like: >> ```python >> def load_inflammation_data(dir_path): ->> data_file_paths = glob.glob(os.path.join(dir_path, 'inflammation*.csv')) +>> data_file_paths = glob.glob(os.path.join(dir_path, 'rain_data_2015*.csv')) >> if len(data_file_paths) == 0: ->> raise ValueError(f"No inflammation csv's found in path {dir_path}") +>> raise ValueError('No CSV files found in the data directory') >> data = map(models.load_csv, data_file_paths) >> return list(data) >> ``` >> This function can now be used in the analysis as follows: >> ```python >> def analyse_data(data_dir): ->> data = load_inflammation_data(data_dir) +>> data = load_catchment_data(data_dir) >> daily_standard_deviation = compute_standard_deviation_by_data(data) >> ... >> ``` @@ -155,7 +155,7 @@ In addition, implementation of the method `get_area()` is hidden too (abstractio {: .callout} > ## Exercise: Use Classes to Abstract out Data Loading -> Declare a new class `CSVDataSource` that contains the `load_inflammation_data` function +> Declare a new class `CSVDataSource` that contains the `load_catchment_data` function > we wrote in the previous exercise as a method of this class. > The directory path where to load the files from should be passed in the class' constructor method. > Finally, construct an instance of the class `CSVDataSource` outside the statistical @@ -164,14 +164,14 @@ In addition, implementation of the method `get_area()` is hidden too (abstractio >> At the end of this exercise, the code in the `analyse_data()` function should look like: >> ```python >> def analyse_data(data_source): ->> data = data_source.load_inflammation_data() +>> data = data_source.load_catchment_data() >> daily_standard_deviation = compute_standard_deviation_by_data(data) >> ... >> ``` >> The controller code should look like: >> ```python ->> data_source = CSVDataSource(os.path.dirname(InFiles[0])) ->> analyse_data(data_source) +>> data_source = compute_data.CSVDataSource(os.path.dirname(InFiles[0])) +>> compute_data.analyse_data(data_source) >> ``` > {: .solution} >> ## Solution @@ -180,16 +180,16 @@ In addition, implementation of the method `get_area()` is hidden too (abstractio >> ```python >> class CSVDataSource: >> """ ->> Loads all the inflammation CSV files within a specified directory. +>> Loads all the catchment CSV files within a specified directory. >> """ >> def __init__(self, dir_path): >> self.dir_path = dir_path >> ->> def load_inflammation_data(self): ->> data_file_paths = glob.glob(os.path.join(self.dir_path, 'inflammation*.csv')) +>> def load_catchment_data(self): +>> data_file_paths = glob.glob(os.path.join(self.dir_path, 'rain_data_2015*.csv')) >> if len(data_file_paths) == 0: ->> raise ValueError(f"No inflammation CSV files found in path {self.dir_path}") ->> data = map(models.load_csv, data_file_paths) +>> raise ValueError('No CSV files found in the data directory') +>> data = map(models.read_variable_from_csv, data_file_paths) >> return list(data) >> ``` >> In the controller, we create an instance of CSVDataSource and pass it @@ -200,16 +200,16 @@ In addition, implementation of the method `get_area()` is hidden too (abstractio >> analyse_data(data_source) >> ``` >> The `analyse_data()` function is modified to receive any data source object (that implements ->> the `load_inflammation_data()` method) as a parameter. +>> the `load_catchment_data()` method) as a parameter. >> ```python >> def analyse_data(data_source): ->> data = data_source.load_inflammation_data() +>> data = data_source.load_catchment_data() >> daily_standard_deviation = compute_standard_deviation_by_data(data) >> ... >> ``` >> We have now fully decoupled the reading of the data from the statistical analysis and >> the analysis is not fixed to reading from a directory of CSV files. Indeed, we can pass various ->> data sources to this function now, as long as they implement the `load_inflammation_data()` +>> data sources to this function now, as long as they implement the `load_catchment_data()` >> method. >> >> While the overall behaviour of the code and its results are unchanged, @@ -218,11 +218,11 @@ In addition, implementation of the method `get_area()` is hidden too (abstractio >> ```python >> ... >> def test_compute_data(): ->> from inflammation.compute_data import analyse_data +>> from catchment.compute_data import analyse_data, CSVDataSource >> path = Path.cwd() / "../data" >> data_source = CSVDataSource(path) >> result = analyse_data(data_source) ->> expected_output = [0.,0.22510286,0.18157299,0.1264423,0.9495481,0.27118211 +>> expected_output = [ [0. , 0.18801829], >> ... >> ``` > {: .solution} @@ -255,9 +255,9 @@ on it and it will return a number representing its surface area. > Think about what functions `analyse_data()` needs to be able to call to perform its duty, > what parameters they need and what they return. >> ## Solution ->> The interface is the `load_inflammation_data()` method, which takes no parameters and ->> returns a list where each entry is a 2D array of patient inflammation data (read from some -> data source). +>> The interface is the `load_catchment_data()` method, which takes no parameters and +>> returns a list where each entry is a 2D array of catchment measurement data (read from some +>> data source). >> >> Any object passed into `analyse_data()` should conform to this interface. > {: .solution} @@ -320,16 +320,22 @@ data sources with no extra work. > ## Exercise: Add an Additional DataSource > Create another class that supports loading patient data from JSON files, with the -> appropriate `load_inflammation_data()` method. +> appropriate `load_catchment_data()` method. > There is a function in `models.py` that loads from JSON in the following format: > ```json -> [ -> { -> "observations": [0, 1] -> }, -> { -> "observations": [0, 2] -> } +>[ +> { +> "Site": "FP35", +> "Site Name": "Lower Wraxall Farm", +> "Date": "01/12/2008 23:00", +> "Rainfall (mm)": 0.0 +> }, +> { +> "Site": "FP35", +> "Site Name": "Lower Wraxall Farm", +> "Date": "01/12/2008 23:15", +> "Rainfall (mm)": 0.0 +> } > ] > ``` > Finally, at run time construct an appropriate instance based on the file extension. @@ -337,18 +343,18 @@ data sources with no extra work. >> The new class could look something like: >> ```python >> class JSONDataSource: ->> """ ->> Loads patient data with inflammation values from JSON files within a specified folder. ->> """ ->> def __init__(self, dir_path): ->> self.dir_path = dir_path +>> """ +>> Loads patient data with catchment values from JSON files within a specified folder. +>> """ +>> def __init__(self, dir_path): +>> self.dir_path = dir_path >> ->> def load_inflammation_data(self): ->> data_file_paths = glob.glob(os.path.join(self.dir_path, 'inflammation*.json')) ->> if len(data_file_paths) == 0: ->> raise ValueError(f"No inflammation JSON's found in path {self.dir_path}") ->> data = map(models.load_json, data_file_paths) ->> return list(data) +>> def load_catchment_data(self): +>> data_file_paths = glob.glob(os.path.join(self.dir_path, 'rain_data_2015*.json')) +>> if len(data_file_paths) == 0: +>> raise ValueError('No JSON files found in the data directory') +>> data = map(models.load_json, data_file_paths) +>> return list(data) >> ``` >> Additionally, in the controller will need to select the appropriate DataSource to >> provide to the analysis: @@ -400,7 +406,7 @@ Now whenever you call `mock_version.method_to_mock()` the return value will be ` > from unittest.mock import Mock > > def test_compute_data_mock_source(): -> from inflammation.compute_data import analyse_data +> from catchment.compute_data import analyse_data > data_source = Mock() > > # TODO: configure data_source mock @@ -419,13 +425,21 @@ Now whenever you call `mock_version.method_to_mock()` the return value will be ` >> from unittest.mock import Mock >> >> def test_compute_data_mock_source(): ->> from inflammation.compute_data import analyse_data ->> data_source = Mock() ->> data_source.load_inflammation_data.return_value = [[[0, 2, 0]], ->> [[0, 1, 0]]] +>> from catchment.compute_data import analyse_data +>> data_source = Mock() +>> +>> data_source.load_catchment_data.return_value = [pd.DataFrame( +>> data=[[1.0, 1.0], +>> [2.0, 1.0], +>> [4.0, 2.0]], +>> index=[pd.to_datetime('2000-01-01 01:00'), +>> pd.to_datetime('2000-01-01 02:00'), +>> pd.to_datetime('2000-01-01 03:00')], +>> columns=['A', 'B'] +>> )] >> ->> result = analyse_data(data_source) ->> npt.assert_array_almost_equal(result, [0, math.sqrt(0.25) ,0]) +>> result = analyse_data(data_source) +>> npt.assert_array_almost_equal(result, [[1.527525, 0.57735 ]]) >> ``` > {: .solution} {: .challenge} From 92917a85f9770c8b91a3b361e346f662f3387e3b Mon Sep 17 00:00:00 2001 From: Douglas Lowe <10961945+douglowe@users.noreply.github.com> Date: Sun, 10 Mar 2024 10:12:32 +0000 Subject: [PATCH 10/12] remove mapreduce text and expand map example --- _extras/refactor-2-code-refactoring.md | 38 ++++++++++++-------------- 1 file changed, 18 insertions(+), 20 deletions(-) diff --git a/_extras/refactor-2-code-refactoring.md b/_extras/refactor-2-code-refactoring.md index b1c38b30e..7ad37673f 100644 --- a/_extras/refactor-2-code-refactoring.md +++ b/_extras/refactor-2-code-refactoring.md @@ -272,29 +272,27 @@ be harder to test but, when simplified like this, may only require a handful of > {: .solution} {: .challenge} -### MapReduce Data Processing Approach - -When working with data you will often find that you need to -apply a transformation to each datapoint of a dataset -and then perform some aggregation across the whole dataset. -One instance of this data processing approach is known as MapReduce -and is applied when processing (but not limited to) Big Data, -e.g. using tools such as [Spark](https://en.wikipedia.org/wiki/Apache_Spark) -or [Hadoop](https://hadoop.apache.org/). -The name MapReduce comes from applying an operation to (mapping) each value in a dataset, -then performing a reduction operation which -collects/aggregates all the individual results together to produce a single result. -MapReduce relies heavily on composability and parallelisability of functional programming - -both map and reduce can be done in parallel and on smaller subsets of data, -before aggregating all intermediate results into the final result. - -> ## Exercise: Mapping +> ## Mapping > `map(f, C)` is a function that takes another function `f()` > and a collection `C` of data items as inputs. > Calling `map(f, C)` applies the function `f(x)` to every data item `x` in a collection `C` > and returns the resulting values as a new collection of the same size. > -> First identify a line of code in the `analyse_data` function which uses the `map` function. +> This is a simple mapping that takes a list of names and +> returns a list of the lengths of those names using the built-in function `len()`: +> ```python +> name_lengths = map(len, ["Mary", "Isla", "Sam"]) +> print(list(name_lengths)) +> ``` +> ```output +> [4, 4, 3] +> ``` +> For more information on mapping functions, and how they can be combined with reduce +> functions, see the [Functional Programming](/34-functional-programming/index.html) episode. +{: .callout} + +> ## Exercise: Mapping +> Identify a line of code in the `analyse_data` function which uses the `map` function. >> ## Solution >> The `map` function is used with the `read_variables_from_csv` function in the `catchment/models.py` module. >> It creates a collection of dataframes containing the data within files defined in the list `data_file_paths`: @@ -304,7 +302,7 @@ before aggregating all intermediate results into the final result. > {: .solution} > > Now create a pure function, `daily_std`, to calculate the standard deviation by day for any dataframe. -> This can take form similar to the `daily_mean` and `daily_max` functions in the `catchment/models.py` file. +> This can take a similar form to the `daily_mean` and `daily_max` functions in the `catchment/models.py` file. > > Then replace the `for` loop below, that is in your `compute_standard_deviation_by_day` function, > with a `map()` function that uses the `daily_std` function to calculate the daily standard @@ -346,7 +344,7 @@ the test is correct from CSV to JSON, the bulk of the tests need not be updated > ## Exercise: Testing a Pure Function -> Add tests for `compute_standard_deviation_by_data()` that check for situations +> Add tests for `compute_standard_deviation_by_day()` that check for situations > when there is only one file with multiple sites, > multiple files with one site, and any other cases you can think of that should be tested. >> ## Solution From 37623fadc1b5b6d9de00ba81e9a75a0a454f1cbc Mon Sep 17 00:00:00 2001 From: Douglas Lowe <10961945+douglowe@users.noreply.github.com> Date: Sun, 10 Mar 2024 10:13:22 +0000 Subject: [PATCH 11/12] merge new material into MVC episode, remove OO related content --- _extras/refactor-4-architecture-revisited.md | 358 ++++++------------- 1 file changed, 115 insertions(+), 243 deletions(-) diff --git a/_extras/refactor-4-architecture-revisited.md b/_extras/refactor-4-architecture-revisited.md index 82524d23b..660ddda11 100644 --- a/_extras/refactor-4-architecture-revisited.md +++ b/_extras/refactor-4-architecture-revisited.md @@ -25,6 +25,20 @@ We've been developing our software using the **Model-View-Controller** (MVC) arc but, as we have seen, MVC is just one of the common architectural patterns and is not the only choice we could have made. +### Separation of Responsibilities + +Separation of responsibilities is important when designing software architectures +in order to reduce the code's complexity and increase its maintainability. +Note, however, there are limits to everything - +and MVC architecture is no exception. +Controller often transcends into Model and View +and a clear separation is sometimes difficult to maintain. +For example, the Command Line Interface provides both the View +(what user sees and how they interact with the command line) +and the Controller (invoking of a command) aspects of a CLI application. +In Web applications, Controller often manipulates the data (received from the Model) +before displaying it to the user or passing it from the user to the Model. + There are many variants of an MVC-like pattern (such as [Model-View-Presenter](https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93presenter) (MVP), [Model-View-Viewmodel](https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93viewmodel) (MVVM), etc.), @@ -33,10 +47,106 @@ What really matters is that we are making decisions about the architecture of ou that suit the way in which we expect to use it. We should reuse these established ideas where we can, but we don't need to stick to them exactly. -In this episode we'll be taking our Object Oriented code from the previous episode -and integrating it into our existing MVC pattern. -But first we will explain some features of -the Controller (`catchment-analysis.py`) component of our architecture. +The key thing to take away is the distinction between the Model and the View code, while +the View and the Controller can be more or less coupled together (e.g. the code that specifies +there is a button on the screen, might be the same code that specifies what that button does). +The View may be hard to test, or use special libraries to draw the UI, but should not contain any +complex logic, and is really just a presentation layer on top of the Model. +The Model, conversely, should not care how the data is displayed. +For example, the View may present dates as "Monday 24th July 2023", +but the Model stores it using a `Date` object rather than its string representation. + +## Our Project's Architecture (Revisited) + +Recall that in our software project, the **Controller** module is in `catchment-analysis.py`, +and the View and Model modules are contained in +`catchment/views.py` and `catchment/models.py`, respectively. +Data underlying the Model is contained within the directory `data`. + +Looking at the code in the branch `full-data-analysis` (where we should be currently located), +we can notice that the new code was added in a separate script `catchment/compute_data.py` and +contains a mix of Model, View and Controller code. + +> ## Exercise: Identify Model, View and Controller Parts of the Code +> Looking at the code inside `compute_data.py`, what parts could be considered +> Model, View and Controller code? +> +>> ## Solution +>> * Computing the standard deviation belongs to Model. +>> * Reading the data from CSV files also belongs to Model. +>> * Displaying of the output as a graph is View. +>> * The logic that processes the supplied files is Controller. +> {: .solution} +{: .challenge} + +Within the Model further separations make sense. +For example, as we did in the before, separating out the impure code that interacts with +the file system from the pure calculations helps with readability and testability. +Nevertheless, the MVC architectural pattern is a great starting point when thinking about +how you should structure your code. + +> ## Exercise: Split out the Model, View and Controller Code +> Refactor `analyse_data()` function so that the Model, View and Controller code +> we identified in the previous exercise is moved to appropriate modules. +>> ## Solution +>> The idea here is for the `analyse_data()` function not to have any "view" considerations. +>> That is, it should just compute and return the data and +>> should be located in `catchment/models.py`. +>> +>> ```python +>> def analyse_data(data_source): +>> """Calculate the standard deviation by day between datasets +>> Gets all the measurement data from the CSV files in the data directory, +>> works out the mean for each day, and then graphs the standard deviation +>> of these means. +>> """ +>> data = data_source.load_catchment_data() +>> daily_standard_deviation = compute_standard_deviation_by_data(data) +>> +>> return daily_standard_deviation +>> ``` +>> There can be a separate bit of code in the Controller `catchment-analysis.py` +>> that chooses how data should be presented, e.g. as a graph: +>> +>> ```python +>> if args.full_data_analysis: +>> _, extension = os.path.splitext(InFiles[0]) +>> if extension == '.json': +>> data_source = JSONDataSource(os.path.dirname(InFiles[0])) +>> elif extension == '.csv': +>> data_source = CSVDataSource(os.path.dirname(InFiles[0])) +>> else: +>> raise ValueError(f'Unsupported file format: {extension}') +>> data_result = analyse_data(data_source) +>> graph_data = { +>> 'daily standard deviation': data_result, +>> } +>> views.visualize(graph_data) +>> return +>> ``` +>> Note that this is, more or less, the change we did to write our regression test. +>> This demonstrates that splitting up Model code from View code can +>> immediately make your code much more testable. +>> Ensure you re-run our regression test to check this refactoring has not +>> changed the output of `analyse_data()`. +> {: .solution} +{: .challenge} + +At this point, you have refactored and tested all the code on branch `full-data-analysis` +and it is working as expected. The branch is ready to be incorporated into `develop` +and then, later on, `main`, which may also have been changed by other developers working on +the code at the same time so make sure to update accordingly or resolve any conflicts. + +~~~ +$ git switch develop +$ git merge full-data-analysis +~~~ +{: .language-bash} + +Let's now have a closer look at our Controller, and how can handling command line arguments in Python +(which is something you may find yourself doing often if you need to run the code from a +command line tool). + ### Controller file structure @@ -421,244 +531,6 @@ so that they are properly read by the parser. -### Adding a new View - -Now that we can select the data we require, -let's add a view that allows us to see the data for a single site. -First, we need to add the code for the view itself -and make sure our `Site` class has the necessary data - -including the ability to pass a list of measurements to the `__init__` method. -Note that your Site class may look very different now, -so adapt this example to fit what you have. - -~~~ python -# file: catchment/views.py - -... - -def display_measurement_record(site): - """Display each dataset for a single site.""" - print(site.name) - for measurement in site.measurements: - print(site.measurements[measurement].series) -~~~ -{: .language-python} - -~~~ python -# file: catchment/models.py - -... - -class MeasurementSeries: - def __init__(self, series, name, units): - self.series = series - self.name = name - self.units = units - self.series.name = self.name - - def add_measurement(self, data): - self.series = pd.concat([self.series,data]) - self.series.name = self.name - - def __str__(self): - if self.units: - return f"{self.name} ({self.units})" - else: - return self.name - -class Location: - def __init__(self, name): - self.name = name - - def __str__(self): - return self.name - -class Site(Location): - def __init__(self,name): - super().__init__(name) - self.measurements = {} - - def add_measurement(self, measurement_id, data, units=None): - if measurement_id in self.measurements.keys(): - self.measurements[measurement_id].add_measurement(data) - - else: - self.measurements[measurement_id] = MeasurementSeries(data, measurement_id, units) - - @property - def last_measurements(self): - return pd.concat( - [self.measurements[key].series[-1:] for key in self.measurements.keys()], - axis=1).sort_index() - -~~~ -{: .language-python} - -Now we need to make sure people can call this view - -that means connecting it to the controller -and ensuring that there's a way to request this view when running the program. - -#### Adapting the Controller - -The changes we need to make here are that the `main` function -needs to be able to direct us to the view we've requested - -and we need to add to the command line interface - the controller - -the necessary data to drive the new view. - -As the argument parsing routines are getting more involved, we have moved these into a -single function (`parse_cli_arguments`), to make the script more readable. -~~~ -# file: catchment-analysis.py - -#!/usr/bin/env python3 -"""Software for managing measurement data for our catchment project.""" - -import argparse - -from catchment import models, views - - -def main(args): - """The MVC Controller of the patient data system. - - The Controller is responsible for: - - selecting the necessary models and views for the current task - - passing data between models and views - """ - infiles = args.infiles - if not isinstance(infiles, list): - infiles = [args.infiles] - - for filename in in_files: - measurement_data = models.read_variable_from_csv(filename, arg.measurements) - - - ### MODIFIED START ### - if args.view == 'visualize': - view_data = {'daily sum': models.daily_total(measurement_data), - 'daily average': models.daily_mean(measurement_data), - 'daily max': models.daily_max(measurement_data), - 'daily min': models.daily_min(measurement_data)} - - views.visualize(view_data) - - elif args.view == 'record': - measurement_data = measurement_data[args.site] - site = models.Site(args.site) - site.add_measurement(arg.measurements, measurement_data) - - views.display_measurement_record(site) - ### MODIFIED END ### - - -def parse_cli_arguments(): - """Definitions and logic tests for the CLI argument parser""" - - parser = argparse.ArgumentParser( - description='A basic environmental data management system') - - req_group = parser.add_argument_group('required arguments') - - parser.add_argument( - 'infiles', - nargs = '+', - help = 'Input CSV(s) containing measurement data') - - req_group.add_argument( - '-m', '--measurements', - help = 'Name of measurement data series to load', - required = True) - - ### MODIFIED START ### - parser.add_argument( - '--view', - default = 'visualize', - choices = ['visualize', 'record'], - help = 'Which view should be used?') - - parser.add_argument( - '--site', - type = str, - default = None, - help = 'Which site should be displayed?') - ### MODIFIED END ### - - args = parser.parse_args() - - if args.view == 'record' and args.site is None: - parser.error("'record' --view requires that --site is set") - - return args - - -if __name__ == "__main__": - - args = parse_cli_arguments() - - main(args) -~~~ -{: .language-python} - -We've added two options to our command line interface here: -one to request a specific view (`--view`) -and one for the site ID that we want to lookup (`--site`). -Note that both are optional, -but have `default` values if they are not set. -For the view option, -the default is for the graphic `visualize` view, -and we have set a defined list of `choices` that users are allowed to specify. -For the site option the default value is `None`. -We have added an `if` statement after the arguments are parsed, -but before calling the `main` function, -to ensure that the site option is set if we are using the `record` view, -which will return an error using the `parser.error` function: -~~~ -python3 catchment-analysis.py --view record -m 'Rainfall (mm)' data/rain_data_2015-12.csv -~~~ -{: .language-bash} -~~~ -usage: catchment-analysis.py [-h] -m MEASUREMENTS [--view {visualize,record}] [--site SITE] infiles [infiles ...] -catchment-analysis.py: error: 'record' --view requires that --site is set -~~~ -{: .output} -Because we used the `parser.error` function, -the usage information for the command is given, -followed by the error message that we have added. - -We can now call our program with these extra arguments to see the record for a single site: - -~~~ -$ python3 catchment-analysis.py --view record --site FP35 -m 'Rainfall (mm)' data/rain_data_2015-12.csv -~~~ -{: .language-bash} - -~~~ -FP35 -2005-12-01 00:00:00 0.0 -2005-12-01 00:15:00 0.0 -2005-12-01 00:30:00 0.0 -2005-12-01 00:45:00 0.0 -2005-12-01 01:00:00 0.0 - ... -2005-12-31 22:45:00 0.2 -2005-12-31 23:00:00 0.0 -2005-12-31 23:15:00 0.2 -2005-12-31 23:30:00 0.2 -2005-12-31 23:45:00 0.0 -Name: Rainfall, Length: 2976, dtype: float64 -~~~ -{: .output} - - -For the full range of features that we have access to with `argparse` see the -[Python module documentation](https://docs.python.org/3/library/argparse.html?highlight=argparse#module-argparse). -Allowing the user to request a specific view like this is -a similar model to that used by the popular Python library Click - -if you find yourself needing to build more complex interfaces than this, -Click would be a good choice. -You can find more information in [Click's documentation](https://click.palletsprojects.com/). - - > ## Additional Material > > Now that we've covered the basics of different programming paradigms @@ -695,4 +567,4 @@ and maintained within a team by having multiple people have a look and comment on key code changes to see how they fit within the codebase. Such reviews check the correctness of the new code, test coverage, functionality changes, and confirm that they follow the coding guides and best practices. -Let's have a look at some code review techniques available to us. +In the following episodes we will have a look at some code review techniques available to us. From cf20cd20648b51967e0415f345bc13b8a8c243bb Mon Sep 17 00:00:00 2001 From: Douglas Lowe <10961945+douglowe@users.noreply.github.com> Date: Sun, 10 Mar 2024 18:22:35 +0000 Subject: [PATCH 12/12] remove patient data reference --- _extras/refactor-3-code-abstractions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_extras/refactor-3-code-abstractions.md b/_extras/refactor-3-code-abstractions.md index ff133f9d4..4a3996256 100644 --- a/_extras/refactor-3-code-abstractions.md +++ b/_extras/refactor-3-code-abstractions.md @@ -319,7 +319,7 @@ Conversely, if we wanted to write a new analysis function, we could support any data sources with no extra work. > ## Exercise: Add an Additional DataSource -> Create another class that supports loading patient data from JSON files, with the +> Create another class that supports loading catchment data from JSON files, with the > appropriate `load_catchment_data()` method. > There is a function in `models.py` that loads from JSON in the following format: > ```json