What has set Amazon apart from the competition in online retail? Their supply chain. In fact, this has long been one of the greatest strengths of one of their chief competitors, Walmart.

Supply chains are highly complex systems consisting of hundreds if not thousands of manufacturers and logistics carriers around the world who combine resources to create the products we use and consume every day. To trackÂ *all*Â of the inputs to aÂ single, simple product would be staggering. Yet supply chain organizations inside vertically integrated corporations are tasked with managing inputs from raw materials, to manufacturing, warehousing, and distribution to customers. The companies that do this best cut down on waste from excess storage, to unneeded transportation costs, and lost time to get products and materials to later stages in the system. Optimizing these systems is a key component in businesses as dissimilar as Apple and Saudi Aramco.

A lot of time and effort has been put into building effective supply chain optimization models, but due to their size and complexity, they can be difficult to build and manage. With advances in machine learning, particularly reinforcement learning, we can train a machine learning model to make these decisions for us, and in many cases, do so better than traditional approaches!

## TL;DR

We train a deep reinforcement learning model usingÂ RayÂ andÂ `or-gym`

Â to optimize a multi-echelon inventory management model andÂ benchmark itÂ against a derivative free optimization model using Powellâ€™s Method.

## Multi-Echelon Supply Chain

In our example, weâ€™re going to work with a **multi-echelon** supply chain model with lead times. This means that we have different stages of our supply chain that we need to make decisions for, and each decision that we make at different levels are going to affect decisions downstream. In our case, we have *M* stages going back to the producer of our raw materials all the way to our customers. Each stage along the way has a different **lead time**, or time it takes for the output of one stage to arrive and become the input for the next stage in the chain. This may be 5 days, 10 days, whatever. The longer these lead times become, the earlier you need to anticipate customer orders and demand to ensure you donâ€™t stock out or lose sales!

**If this in-depth educational content on is useful for you, you canÂ subscribe to our AI research mailing listÂ to be alerted when we release new material.Â **

## Inventory Management with OR-Gym

The OR-Gym library has a few multi-echelon supply chain models ready to go to simulate this structure. For this, weâ€™ll use the `InvManagement-v1`

environment, which has the structure shown above, but results in lost sales if you don’t have sufficient inventory to meet customer demand.

If you havenâ€™t already, go ahead and install the package with:

pip install or-gym

Once installed, we can set up our environment with:

env = or_gym.make('InvManagement-v1')

This is a four-echelon supply chain by default. The actions determine how much material to order from the echelon above at each time step. The orders quantities are limited by the capacity of the supplier and their current inventory. So, if you order 150 widgets from a supplier that has a shipment capacity of 100 widgets and only has 90 widgets on hand, youâ€™re going to only get 90 sent.

Each echelon has its own costs structure, pricing, and lead times. The last echelon (Stage 3 in this case) provides raw materials, and we donâ€™t have any inventory constraints on this stage, assuming that the mine, oil well, forest â€” or whatever produces your raw material inputs â€” is large enough that this isnâ€™t a constraint we need to concern ourselves with.

As with all `or-gym`

environments, if these settings don’t suit you, simply pass an environment configuration dictionary to the `make`

function to customize your supply chain accordingly (an example is given here).

## Training with Ray

To train your environment, weâ€™re going to leverage theÂ Ray libraryÂ to speed up our training, so go ahead and import your packages.

import or_gym from or_gym.utils import create_env import ray from ray.rllib import agents from ray import tune

To get started, weâ€™re going to need a brief registration function to ensure that Ray knows about the environment we want to run. We can register that with the `register_env`

function shown below.

def register_env(env_name, env_config={}): env = create_env(env_name) tune.register_env(env_name, lambda env_name: env(env_name, env_config=env_config))

From here, we can set up our RL configuration and everything we need to train the model.

# Environment and RL Configuration Settings env_name = 'InvManagement-v1' env_config = {} # Change environment parameters here rl_config = dict( env=env_name, num_workers=2, env_config=env_config, model=dict( vf_share_layers=False, fcnet_activation='elu', fcnet_hiddens=[256, 256] ), lr=1e-5 ) # Register environment register_env(env_name, env_config)

TheÂ `rl_config`

Â dictionary is where you can set all of the relevant hyperparameters or set your system to run on a GPU. Here, we’re just going to use 2 workers for parallelization, and train a two-layer network with an ELU activation function. Additionally, if you’re going to useÂ `tune`

Â for hyperparameter tuning, then you can use tools likeÂ `tune.gridsearch()`

Â to systematically update learning rates, change the network, or whatever you like.

Once your happy with that, go head and choose your algorithm and get to training! Below, I just use the PPO algorithm because I find it trains well on most environments.

# Initialize Ray and Build Agent ray.init(ignore_reinit_error=True) agent = agents.ppo.PPOTrainer(env=env_name, config=rl_config) results = [] for i in range(500): res = agent.train() results.append(res) if (i+1) % 5 == 0: print('\rIter: {}\tReward: {:.2f}'.format( i+1, res['episode_reward_mean']), end='') ray.shutdown()

The code above will initialize `ray`

, then build the agent according to the configuration you specified previously. If you’re happy with that, then let it run for a bit and see how it does!

One thing to note with this environment: if the learning rate is too high, the policy function will begin to diverge such that the loss becomes astronomically large. At that point, youâ€™ll wind up getting an error, typically stemming from Rayâ€™s default pre-processor with state showing bizarre values because the actions being given by the network are all `nan`

. This is easy to fix by bringing the learning rate down a bit and trying again.

Letâ€™s take a look at the performance.

import numpy as np import matplotlib.pyplot as plt from matplotlib import gridspec # Unpack values from each iteration rewards = np.hstack([i['hist_stats']['episode_reward'] for i in results]) pol_loss = [ i['info']['learner']['default_policy']['policy_loss'] for i in results] vf_loss = [ i['info']['learner']['default_policy']['vf_loss'] for i in results] p = 100 mean_rewards = np.array([np.mean(rewards[i-p:i+1]) if i >= p else np.mean(rewards[:i+1]) for i, _ in enumerate(rewards)]) std_rewards = np.array([np.std(rewards[i-p:i+1]) if i >= p else np.std(rewards[:i+1]) for i, _ in enumerate(rewards)]) fig = plt.figure(constrained_layout=True, figsize=(20, 10)) gs = fig.add_gridspec(2, 4) ax0 = fig.add_subplot(gs[:, :-2]) ax0.fill_between(np.arange(len(mean_rewards)), mean_rewards - std_rewards, mean_rewards + std_rewards, label='Standard Deviation', alpha=0.3) ax0.plot(mean_rewards, label='Mean Rewards') ax0.set_ylabel('Rewards') ax0.set_xlabel('Episode') ax0.set_title('Training Rewards') ax0.legend() ax1 = fig.add_subplot(gs[0, 2:]) ax1.plot(pol_loss) ax1.set_ylabel('Loss') ax1.set_xlabel('Iteration') ax1.set_title('Policy Loss') ax2 = fig.add_subplot(gs[1, 2:]) ax2.plot(vf_loss) ax2.set_ylabel('Loss') ax2.set_xlabel('Iteration') ax2.set_title('Value Function Loss') plt.show()

It looks like our agent learned a decent policy!

One of the difficulties of deep reinforcement learning for these classic, operations research problems is the lack of optimality guarantees. In other words, we can look at that training curve above and see that it is learning a better and better policy â€” and it seems to be converging on a policy â€” but we donâ€™t know *how* good that policy is. Could we do better? Should we invest more time (and money) into hyperparameter tuning? To answer this, we need to turn to some different methods and develop a benchmark.

## Derivative Free Optimization

A good way to benchmark an RL model is with **derivative free optimization** (DFO). Like RL, DFO treats the system as a black-box model providing inputs and getting some feedback in return to try again as it seeks the optimal value.

Unlike RL, DFO has no concept of a state. This means that we will try to find a fixed re-order policy to bring inventory up to a certain level to balance holding costs and profit from sales. For example, if the policy at stage 0 is to re-order up to 10 widgets, and the currently, we have 4 widgets, then the policy states weâ€™re going to re-order 6. In the RL case, it would take into account the current pipeline and all of the other information that we provide into the state. So RL is more adaptive and ought to outperform a straightforward DFO implementation. If it doesnâ€™t, then we know we need to go back to the drawing board.

While it may sound simplistic, this fixed re-order policy isnâ€™t unusual in industrial applications, partly because real supply chains consist of many more variables and interrelated decisions than weâ€™re modeling here. So a fixed policy is tractable and something that supply chain professionals can easily work with.

### Implementing DFO

There are a lot of different algorithms and solvers out there for DFO. For our purposes, weâ€™re going to leverage Scipyâ€™s `optimize`

library to implement Powell’s Method. We won’t get into the details here, but this is a way to quickly find minima on functions and can be used for discrete optimization – like we have here.

from scipy.optimize import minimize

Because weâ€™re going to be working with a fixed re-order policy, we need a quick function to translate inventory levels into actions to evaluate.

def base_stock_policy(policy, env): ''' Implements a re-order up-to policy. This means that for each node in the network, if the inventory at that node falls below the level denoted by the policy, we will re-order inventory to bring it to the policy level. For example, policy at a node is 10, current inventory is 5: the action is to order 5 units. ''' assert len(policy) == len(env.init_inv), ( 'Policy should match number of nodes in network' + '({}, {}).'.format( len(policy), len(env.init_inv))) # Get echelon inventory levels if env.period == 0: inv_ech = np.cumsum(env.I[env.period] + env.T[env.period]) else: inv_ech = np.cumsum(env.I[env.period] + env.T[env.period] - env.B[env.period-1, :-1]) # Get unconstrained actions unc_actions = policy - inv_ech unc_actions = np.where(unc_actions>0, unc_actions, 0) # Ensure that actions can be fulfilled by checking # constraints inv_const = np.hstack([env.I[env.period, 1:], np.Inf]) actions = np.minimum(env.c, np.minimum(unc_actions, inv_const)) return actions

The `base_stock_policy`

function takes the policy levels we supply and calculates the difference between the level and the inventory as described above. One thing to note, when we calculate the inventory level, we include all of the inventory in transit to the stage as well (given in `env.T`

). For example, if the current inventory on hand for stage 0 is 100, and there is a lead time of 5 days between stage 0 and stage 1, then we take all of those orders for the past 5 days into account as well. So, if stage 0 ordered 10 units each day, then the inventory at this echelon would be 150. This makes policy levels greater than capacity meaningful because we’re looking at more than just the inventory in our warehouse today, but looking at everything in transit too.

Our DFO method needs to make function evaluation calls to see how the selected variables perform. In our case, we have an environment to evaluate, so we need a function that will run an episode of our environment and return the appropriate results.

def dfo_func(policy, env, *args): ''' Runs an episode based on current base-stock model settings. This allows us to use our environment for the DFO optimizer. ''' env.reset() # Ensure env is fresh rewards = [] done = False while not done: action = base_stock_policy(policy, env) state, reward, done, _ = env.step(action) rewards.append(reward) if done: break rewards = np.array(rewards) prob = env.demand_dist.pmf(env.D, **env.dist_param) # Return negative of expected profit return -1 / env.num_periods * np.sum(prob * rewards)

Rather than return the sum of the rewards, weâ€™re returning the negative expectation of our rewards. The reason for the negative is the Scipy function weâ€™re using seeks to minimize whereas our environment is designed to maximize the reward, so we invert this to ensure everything is pointing in the right direction. We calculate the expected rewards by multiplying by the probability of our demand based on the distribution. We could take more samples to estimate the distribution and calculate our expectation that way (and for many real-world applications, this would be required), but here, we have access to the true distribution so we can use that to reduce our computational burden.

Finally, weâ€™re ready to optimize.

The following function will build an environment based on your configuration settings, take our `dfo_func`

to evaluate, and apply Powell’s Method to the problem. It will return our policy and ensure that our answer contains only positive integers (e.g. we can’t order half a widget or a negative number of widgets).

def optimize_inventory_policy(env_name, fun, init_policy=None, env_config={}, method='Powell'): env = or_gym.make(env_name, env_config=env_config) if init_policy is None: init_policy = np.ones(env.num_stages-1) # Optimize policy out = minimize(fun=fun, x0=init_policy, args=env, method=method) policy = out.x.copy() # Policy must be positive integer policy = np.round(np.maximum(policy, 0), 0).astype(int) return policy, out

Now itâ€™s time to put it all together.

policy, out = optimize_inventory_policy('InvManagement-v1', dfo_func) print("Re-order levels: {}".format(policy)) print("DFO Info:\n{}".format(out))Re-order levels: [540 216 81] DFO Info: direc: array([[ 0. , 0. , 1. ], [ 0. , 1. , 0. ], [206.39353826, 81.74560612, 28.78995703]]) fun: -0.9450780368543933 message: 'Optimization terminated successfully.' nfev: 212 nit: 5 status: 0 success: True x: array([539.7995151 , 216.38046861, 80.66902905])

Our DFO model found a fixed-stock policy with re-order levels at 540 for stage 0, 216 for stage 1, and 81 for stage 2. It did this with only 212 function evaluations, i.e. it simulated 212 episodes to find the optimal value.

We can run then feed this policy into our environment, say 1,000 times, to generate some statistics and compare it to our RL solution.

env = or_gym.make(env_name, env_config=env_config) eps = 1000 rewards = [] for i in range(eps): env.reset() reward = 0 while True: action = base_stock_policy(policy, eenv) s, r, done, _ = env.step(action) reward += r if done: rewards.append(reward) break

## Comparing Performance

Before we get into the reward comparisons, note that these are not perfect, 1:1 comparisons. As mentioned before, DFO yields us a fixed policy whereas RL has a more flexible, dynamic policy that changes based on state information. Our DFO approach was also given some information in terms of probabilities of demand to calculate the expectation on, RL had to infer that from additional sampling. So while RL learned from nearly ~65k episodes and DFO only had to make 212 function calls, they arenâ€™t exactly comparable. Considering that to enumerate every meaningful fixed policy once would require ~200 million episodes, then RL doesnâ€™t look so sample inefficient given its task.

So, how do these stack up?

What we can see above is that RL does indeed outperform our DFO policy by 11% on average (460 to 414). The RL model overtook the DFO policy after ~15k episodes and improved steadily after that. There is some higher variance with the RL policy however, with a few terrible episodes thrown in to the mix. All things considered, we did get stronger results overall from the RL approach, as expected.

In this case, neither method was very difficult to implement nor computationally intensive. I forgot to change my `rl_config`

settings to run on my GPU and it still only took about 25 minutes to train on my laptop while the DFO model took ~2 seconds to run. More complex models may not be so friendly in either case.

Another thing to note, both methods can be very sensitive to initial conditions and neither are guaranteed to find the optimum policy in every case. If you have a problem youâ€™d like to apply RL to, maybe use a simple DFO solver first, try a few initial conditions to get a feel for the problem, then spin up the full, RL model. You may find that the DFO policy is sufficient for your task.

Hopefully this gave a good overview of how to use these methods and the `or-gym`

library. Leave feedback or questions if you have any!

*This article was originally published on DataHubbs and re-published to TOPBOTS with permission from the author.*

## Enjoy this article? Sign up for more applied AI updates.

Weâ€™ll let you know when we release more technical education.

## Leave a Reply