Edgecase | An approach to caching

Goal

To describe an approach to data management in a client application.

Contents

- Goal
- Contents
- Abstract
- Introduction
- Sources of Truth
- Everything is a model
- The query cycle
- Query registration
- The overlapping Query problem
- Improvements
- Conclusion

Abstract

This approach to data caching makes use of a set of records stored by a client application that are common to any API with appropriate adaptors between the source data and cached data. This allows the client to use a common language to express a desire for data to display or to use in a calculation needed for the user interface. Although inspired by systems like Redux, designed for JavaScript, the concept is easy to apply to other situations.

Introduction

My work on more complex client applications has led me to the conclusion that design of the data management process is as important as the design of the user interface, if not more so. It is the question of what data gets loaded, when, and in what order. Once data is loaded, how is it cached and accessed?

My current best approach involves splitting the data loading process into sets of independent records: Query, Resolution, API, Model, and Instance. Access to any API, or a Source of Truth (to be defined later), is mediated by this common interface. Adaptors to each API are the only differentiating pieces in an otherwise uniform caching system.

The solution is currently based on Redux, but I believe the principles are widely applicable. I will first discuss the basic framework and some core principles. Then, I will look at some weaknesses of this approach. Lastly, I will look at some things to try in the future and how systems other than Redux and JavaScript could improve the approach.

Sources of Truth

A Source of Truth is any data store that contains data uniquely. That is, the data it contains cannot be obtained elsewhere. A good example is a database accessed through a REST API. If the database contains a list of users, a client cannot know that list of users before requesting it. If the list has changed since it was last accessed, any update provided to clients is taken as more "true" than the temporarily cached list that they already have. No comparison is made on the client to confirm this. The data from the Source of Truth will simply override any associations that the client has cached.

This description also applies to the value of the URL in the browser, although this might not be immediately obvious. The URL can be used for routing and may contain data such as the ID of a resource the user wishes to access. The URL is the sole source of this information and it cannot be known before it is entered by the user. Thus, the URL is a Source of Truth.

The approach presented here handles all Sources of Truth using the same process.

Everything is a model

An aspect of this approach is the idea that every piece of data should be expressible as an instance of a model. A model has a set of properties, such as a name and a unique id. The result of any query should be a set of model instances. This includes what would appear to be isolated calls to

/some_endpoint_that_returns_a_result

. Ideally, an API should be designed with this in mind, and only allow access to models according to a filter, functions called on model instances, and functions that return model instances.

If an API does not conform to this, the client adaptors written for each API will force communication into this pattern to ensure uniformity in the client. An important part of this is the establishment of a unique, persistent ID for each piece of data. If a piece of data is not delivered with a unique ID, it will be generated by combining properties such as name, type, or creation date such that a piece of data can be disambiguated when accessed repeatedly.

An example of how data may be transformed to allow unique identification:

	import sha256 from '...';

	class User {

	generateLocalId({ name, dateOfBirth }) {
	const digestibleString = `${name}:${dateOfBirth}`;

	return sha256(digestibleString);
	}

	transformForStorage(data) {
	const { attributes, relationships } = user;
	const localId = this.generateLocalId(attributes);

	return {
	[localId]: {
	id: localId,
	attributes,
	relationships,
	};
	}
	}

	}

The query cycle

To access data, the requester must start with a Query. This contains a standardised description of what the requester wishes to access. This is very similar to the concept of an SQL query. By passing the request through this standardised format, adaptors can translate it into the format expected by a particular API.

An example of a query:

	{
	"queryId": "[uuid]",
	"model": "User",
	"filter": {
	"and": [
	{ "name__icontains": "b" },
	{ "name__icontains": "anything" }
	]
	},
	"sort": [
	"-name"
	],
	"page": 0,
	"size": 10,
	"create": false,
	"remove": false
	}

An example of code used to generate a query using a React Hook,

User.objects.userFilter

	const api = new UserAPI();
	const { User } = api.models;

	const users = User.objects.useFilter({
	filter: (fields, { and }) => and(
	fields.NAME.icontains('b'),
	fields.NAME.icontains('anything'),
	),
	sort: fields => [
	fields.NAME.descending,
	],
	});

A Query is generated by using a Model, such as

User

, which in turn is uniquely associated with an API. This chain of generation tells the system everything it needs to know in order to contact the correct API. Further, the parameters of the Query are compressed to generate a unique query ID. Generating a Query with the same parameters will generate the same ID and immediately yield any matching data that is already cached.

Queries are either resolved or not. Data returned from an API can resolve a Query. A Redux Saga watches the set of unresolved Queries and makes requests to the appropriate API. When a response is received, the Query is updated to mark it as resolved, a Resolution record is generated containing the ordered list of data IDs to display. A set of unordered Instance records are also created.

A set of linked records:

	{
	"query": {
	"[queryId]": {
	"attributes": {
	"resolved": true
	},
	"relationships": {
	"model": "[modelId]",
	"api": "[apiId]"
	}
	}
	},
	"resolution": {
	"[queryId]": {
	"relationships": {
	"instance": [
	"[instanceId]"
	]
	}
	}
	},
	"instance": {
	"[instanceId]": {
	"relationships": {
	"model": "[modelId]"
	}
	}
	},
	"model": {
	"[modelId]": {
	"relationships": {
	"api": "[apiId]"
	}
	}
	},
	"api": {
	"[apiId]": {

	}
	}
	}

The combination of the ordered list of IDs from the Resolution and the unordered data in the form of Instances is the data that will ultimately be delivered to the requester. A potentially controversial consequence of this is the fact that Instance records cached in the local store are not indexed by Model type, but only by ID. This is because the store is not a Source of Truth and will never be queried directly. The store in the client is an aggregator for many Sources of Truth used to cache query results.

Because of this, it is difficult to answer the question "all Instances of type User currently cached". This can be done inefficiently by filtering Instance records, but it is better to optimise the system to answer queries from the Sources of Truth rather than the temporary cache.

Query registration

Rather than simply fetching and returning data that matches each query in the moment that it is requested, the server running the API can be configured to respond in a much more powerful way to the Query system. It can store a record of Queries that have been submitted by each client. When data is updated by another client or a server process, it can be compared against registered client Queries and changed data can be pushed immediately to the appropriate clients. In this way, a form of "live database" can be implemented.

Queries can also be unregistered when they are no longer needed, letting the server know what needs to be fulfilled and what does not. Queries can be stored on a per-session basis and discarded when the session is complete, or serve as a more permanent record of the data accessed by each client.

The overlapping Query problem

If a frontend client, which often has limited resources, is given the responsibility of managing queries efficiently, it will be necessary to ensure that expensive work is not repeated. This leads to the same optimisation problems found in database query systems. For example, it would be efficient to be able to calculate that the query

ALL USERS

logically contains the query

USERS FILTERED BY name contains "string"

. This would avoid registering the second query and the repeated work when filtering. However, this optimisation requires calculations that could be equally expensive.

There are some shortcuts that avoid both optimisation and repeated work. If a single resource ID is detected in the query, the cache can be checked first before the query is registered with the server and the cached resource can be returned immediately. Filter statements can also be checked for exact matches to see if a two queries differ only by an added term, but this type of calculation should be quickly stifled to prevent misuse of time that should be spent improving the user experience.

The fact that queries can easily overlap is a weakness of this approach, and it is theoretically possible to construct two strongly overlapping queries that cannot be simplified. These inefficient edges should be avoided by responsible limitation of the query parameters, such as page size.

Improvements

Redux solves a very JavaScript-specific problem, which is the inability to detect a change in an object stored in memory. The Redux solution is to treat objects as immutable and duplicate the object so that the data change is represented by the difference in reference between the original and duplicate objects. Pieces of this approach that exist only to handle this limitation can thus be removed.

A language without this limitation does not need to resort to this method. Despite this, the approach presented here, which originally grew out of the pattern that Redux encourages, can be used elsewhere without any conceptual changes. Buffers representing a linked set of records in the form of Queries, Resoutions, etc. can still be used to track incoming data from different Sources of Truth.

In the future, I hope to be able to use this approach in a non-browser context using another, more powerful language to run a client application.

Conclusion

In conclusion, this approach to client data caching offers some advantages when organising the loading of large amounts of data from a set of APIs. Firstly, it provides a framework for classifying the data on the client so it can be easily accessed regardless of its original format. Secondly, an API can be designed to easily track the queries made by a client and intelligently and asynchronously update the client with new or modified data. Lastly, despite its inspiration in the patterns encouraged by Redux as a response to JavaScript's limited memory management, the approach is easily replicable in other languages for clients other than a browser.