Controlling Crack Propagation Using Reinforcement Learning

This article presents the application of a reinforcement learning control framework based on the Deep Deterministic Policy Gradient. The crack propagation process is simulated in Abaqus, which is integrated with a reinforcement learning environment to control crack propagation in brittle material. The real-world deployment of the proposed control framework is also discussed.

August 8, 2023

Data Science and Digital Engineering

Magic glow in ground cracks, green glowing texture in cracking holes or ruined land surface. Destruction, shining lava in split, damage fissure effect after disaster, Realistic 3d vector isolated set

Cracks determine transport, mechanical, and chemical properties. The ability to control crack propagation in a material is crucial for manipulating the spatiotemporal variations of material properties. In the domain of subsurface engineering, the control of the subsurface fractures improves the production of natural gas and petroleum, the development of engineered geothermal systems, and the long-term storage of CO₂ (Pyrak-Nolte 2015).

Reinforcement learning (RL) is a machine-learning technique developed specifically for learning sequential decision making and control strategies. Unlike the supervised and unsupervised learning, RL learns from the experience obtained through dynamically interacting with the environment by taking actions, observing the state of the environment, and receiving rewards based on the action and state. By continuously interacting with an environment, the RL agent learns a control policy that maximizes the expected cumulative discounted future reward. The Deep Deterministic Policy Gradient (DDPG) algorithm is an off-policy reinforcement learning algorithm introduced by Lillicrap et al. (2015). This algorithm is suitable for control problems that have high-dimensional, continuous state and action spaces.

DDPG is a model-free, off-policy algorithm that combines the ideas from deterministic policy gradient method, deep Q-learning, and actor-critic techniques. It aims to maximize the action-value (Q-value) function that evaluates the performance of the control policy. Q-value is the expected return (also known as the expected cumulative discounted rewards) after taking a specific action for a specific state and, thereafter, following that policy. By implementing deep neural networks as function approximator in both actor and critic models, the DDPG algorithm is able to handle control problems with high-dimensional, continuous state and action space, thus suitable for the crack growth control problem investigated in this study.

The DDPG-based RL control framework has been successfully applied by the authors for controlling the crack propagation in simple and complex synthetic environments (Jin and Misra 2022, 2023). In this article, the DDPG-based RL control framework is integrated with Abaqus simulator, which simulates crack propagation, to demonstrate the application of the RL technique for controlling crack propagation in brittle material.

Description of the Environment Simulated in Abaqus
Abaqus is used to simulate crack propagation in a thin, rectangular material sample having a width of 13 mm and height of 25 mm under surface traction, as shown in Fig. 1. The material sample has a Young’s modulus of 50.0 GPa and a Poisson’s ratio of 0.3. The material is under a maximum principal stress of 30.0 MPa with a maximum displacement of 0.05 and cohesive coefficient of 1×10⁻⁵. The left edge of the material is fixed with encastre boundary condition, and surface tractions are applied at the top and bottom edges. An initial crack with length of 3.0 mm is located at the middle of the left edge.

**Fig. 1—**The training environment created in the Abaqus simulator to learn the control policy for controlling crack propagation under surface traction. The thin, rectangular material sample has a width of 13 mm and height of 25 mm. The top and bottom edges of the material are under surface tractions. The crack propagates from initial crack on the left edge to a randomly assigned goal point on the right edge. The RL agent learns to control the magnitudes and directions of surface tractions to learn the desired control policy.

An RL agent has to control the directions and magnitudes of the surface tractions on the top and bottom edges such that the crack propagates from the initial crack on the left edge to a predetermined goal point on the right edge (Fig. 1). For purposes of training the RL agent, the goal point is randomly placed on the right edge for each training episode, giving the learning agent more opportunity to explore the action and solution spaces and develop a robust control strategy to accomplish the desired control task.

The state of the environment is defined in terms of the location of crack tip (x_tip, y_tip) on the left edge and the location of the goal point (x_goal, y_goal) on the right edge. The state space is constrained by the area of the 2D material. The RL agent learns to reach the single goal point by controlling the directions, θ_top and θ_bottom, and the magnitudes, σ_top and σ_bottom, of the surface tractions on the top and bottom edges. The action space of θ_top and θ_bottom is constrained between [−45°, 45°], and that of σ_top and σ_bottom is constrained between [200.0, 1000.0] MPa.

Interactions of the DDPG-Based RL Framework With the Abaqus-Based Environment Fig. 2 shows the integration of the DDPG-based RL control framework with the Abaqus environment. The RL agent has to learn to control the crack propagation from the initial crack to the goal point by modifying the surface tractions. The RL agent learns the control policy by interacting the with Abaqus-based environment. Abaqus script file determines the state of the environment. The Abaqus script file is modified according to the action (load) generated by the RL agent. The script file is used to build an Abaqus model based on the latest state and the newly generated boundary condition (load) and submit the simulation job to Abaqus. Once the job is done, the script file extracts the crack information (PHILSM data) from the Abaqus output database to create the result file, which captures the crack propagation based on the action of the RL agent and the prior state of the environment defined using Abaqus. Based on the result file, a Python script calculates the reward to be assigned to the action of the RL agent and the latest state of the environment, which is used to build a new Abaqus model for further update of the control policy.

**Fig. 2—**Integration of the DDPG-based RL control framework with the Abaqus environment. The work flow shows the interactions between the RL agent and environment to generate transitions into new state based on the prior state. Reward is assigned based on the usefulness of the new state toward the desired control task.

For successful development of reinforcement learning, the reward function computes a suitable magnitude and behavior of the reward based on the latest action taken and the current/subsequent state of the environment. An optimum reward function provides necessary information to the neural networks implemented in the RL scheme to boost the speed of learning while maintaining stability in the learning (i.e., balancing the exploitation and exploration tradeoff). A reward function that gives negative rewards encourages the agent to reach the goal quickly, but it may lead to early termination to avoid receiving high accumulated penalties. A reward function that gives positive rewards encourages the agent to keep going to accumulate the rewards, but it may cause the agent to move slowly toward the target to accumulate the reward.

RL Agent
The RL agent is represented by a neural network. Four deep neural networks are used as the learning agents in the reinforcement learning framework: actor network μ, target actor network μʹ, function approximator for the Q-value function, referred as the critic network Q, and target critic network Qʹ. In this paper, the neural networks were built in the Keras platform. The target critic/actor networks have the same architecture as the critic/actor networks with the same initial weights. The actor network takes state as input and then deterministically computes the specific action, while the critic network takes both the state and action as inputs and then computes a scalar Q-value (Jin and Misra 2022, 2023).

The actor network represents the current deterministic policy by mapping the state to the specific action deterministically. In this study, the weight updates for critic and actor networks are accomplished in the Tensorflow with Adam optimizer. The initial weights for the target critic/actor networks are the same as the critic/actor networks. For each time step, both the target networks are updated slowly (also referred to as “soft” update) according to rate τ, where τ ≪ 1. The use of soft update ensures stability in learning by making the moving target y_i change slowly.

Table 1 summarizes the tuning parameters used during the training stage. The actor/critic network learning rate and target networks update rate are properly tuned to achieve a fast and stable training. The discount factor represents the importance of future rewards. It is set to be 0.0 because it takes only one step to reach the final state. The capacity of replay buffer defines the maximum number of interactions stored. It is set to be equal to the total training episodes so that all the previous training experience is utilized. The minibatch size defines the number of interactions used to update the networks at each simulation time step. The control process can be easily scaled to multiple steps. The whole training process takes about 5 hours using Intel Xeon CPU E5-1650 v3 with 6 cores and 32 GB RAM.

Parameter	Value
Actor network learning rate	0.0005
Critic network learning rate	0.01
Target networks update rate τ	0.005
Discount factor γ	0.0
Capacity of replay buffer R	500
Minibatch size N	64
Total training episodes	500

Table 1—Tuning parameters used during the training stage. The parameters are properly tuned to achieve optimal training performance.

Fig. 3 shows an example of Abaqus simulation of crack propagation during the training stage, where a simulated crack growth path from an initial crack in a thin, rectangular material is shown. Based on the Abaqus parameter PHILSM (signed distance function to describe the crack surface), a Python code is developed to post process the Abaqus simulation result. The post-processed result is shown in Fig. 4. The reproduced crack path closely matches the actual Abaqus simulation result. Fig. 5 shows the reward history in the training stage. The reward is measured based on the distance between the goal point and the actual crack tip when crack reaches the right edge of the material. A small negative reward means the distance is small, which indicates good control, whereas a large negative reward means the distance is large, which represents poor control. The maximum possible reward the RL agent may obtain is 0. Initially, the RL agent did not learn a proper control strategy. After 400 training episodes, the RL agent was able to develop a good and stable control strategy. This figure demonstrates the capability of the DDPG-based RL framework of controlling the crack propagation in the Abaqus simulator.

**Fig. 3—**Example of an Abaqus simulation of crack propagation during the training stage. The crack propagation is a function of material properties and the surface tractions. One simulated crack growth path from an initial crack in a thin, rectangular material is shown. For each training episode, the reinforcement learning scheme will train the learning agent to control the crack growth from the left edge to reach the randomly selected goal point on the right edge. The initial crack of length l₀ is at the center of the left edge. The RL agent needs to learn to control the directions and magnitudes of the surface tractions to control the crack propagation until the goal point on the right edge.

**Fig. 4—**Post-processed Abaqus simulation result. The left plot shows the reproduced crack path, which closely matches the actual Abaqus simulation result. The red line represents the crack path, and the blue dot on the right edge represents the goal point. The right plot shows the action generated by the RL agent for the desired control. The crack reaches the goal point by controlling the directions and magnitudes of the top and bottom surface tractions.

**Fig. 5—**Reward history in the training stage. The reward is measured based on the distance between the goal point and the actual crack tip when crack reaches the right edge of the material. Initially, the RL agent did not learn a proper control strategy; consequently, the rewards are large negative values, which represent negative feedback. After 400 training episodes, the RL agent was able to develop a good and stable control strategy, represented by the small negative-valued reward close to zero.

Challenges in Real-World Deployment of RL Agents In the real-world deployment, a continuous monitoring of the state can be achieved by performing a continuous computer tomography scan on the material. The crack location can also be inferred using sonic wave. A flowchart of learning from a numerical environment (simulator) followed by evaluation and deployment on real-world materials is presented in Fig. 6.

The RL agent needs to interact with real-world material by tuning certain engineering parameters that ultimately controls the crack propagation/growth. In such cases, the signal/feedback that is returned to the agent from the environment need to be properly defined. In addition, the engineering parameters to be tuned by the RL agent need to be properly defined and integrated with the actions generated by the RL agent. To be trained for the deployment in real-world scenario, noise needs to be added in the training stage to simulate the randomness in the environment. The RL agent can handle the randomness because they receive real-time feedback from the environment and can react instantly to achieve a stable control.

**Fig. 6—**A flowchart of learning from a numerical environment (simulator) followed by evaluation and deployment on a real-world material.

The real-world challenges to be encountered during the deployment of an RL agent are the following:

The reliability of the simulator can be a bottleneck on the training performance. A realistic simulator is required to decrease the gap between simulator and real-world environment in order to maintain a robust control during the deployment of the agent to the real world.
The efficiency of the simulator can be another bottleneck on the training performance. The simulation time for each training episode should be short considering thousands of episodes are required for the agent to learn the optimal control policy. For a complex environment, the required training episodes can be even higher. Consequently, the computational cost and time can be a technical challenge.

It’s also possible to learn from a real-world environment (using experimental device instead of a simulator in Fig. 6) followed by evaluation and deployment on a real-world material. This is much more expensive than the previous approach. Learning from real-world materials will require up to 20,000 material samples, which will pose technical challenges in terms of speed, cost, and infrastructure. Nonetheless, sufficient evaluation of real-world materials will be needed before the deployment. In such case, the real-world challenges to be encountered are the following:

Since the training process can takes thousands of episodes depending on the complexity of the problem, the training of the RL agents in the real-world environment can be costly and require careful development of laboratory infrastructure. In most of the cases, it’s not possible to train the agent based on the real-world environment.
The sensor and controller should be accurate and agile enough to obtain the robust monitoring and control.

Conclusions
This article presents the application of DDPG-based reinforcement learning control framework on real-world control problems. The crack propagation process is simulated in Abaqus, which is integrated in the reinforcement learning environment. The proposed control framework demonstrates a powerful control ability, which has a potential of being applied to the real-world domains including geomechanics, civil engineering, hydraulic fracturing for producing hydrocarbon and geothermal energy, ceramics, structural health monitoring, and material science, to name a few. The real-world deployment of the proposed control framework along with potential challenges and solutions is also discussed.

Acknowledgements
This research work is supported by the US Department of Energy, Office of Science, Office of Basic Energy Sciences, Chemical Sciences, Geosciences, and Biosciences Division under the Award Number DE-SC0020675.

References
Pyrak-Nolte, L.J., DePaolo, D.J., and Pietraß, T. 2015. Controlling Subsurface Fractures and Fluid Flow: A Basic Research Agenda. USDOE Office of Science. (United States).

Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. 2015. Continuous Control With Deep Reinforcement Learning. arXiv.

Jin, Y., and Misra, S. 2023. Controlling Fracture Propagation Using Deep Reinforcement Learning. Engineering Applications of Artificial Intelligence. June (122) 106075.

Jin, Y., and Misra, S. 2022. Controlling Mixed-Mode Fatigue Crack Growth Using Deep Reinforcement Learning. Applied Soft Computing. September (127) 109382.

Controlling Crack Propagation Using Reinforcement Learning

Topics

Tags