AI Robotics Case - Controlling Robots with LLMs (Large Language Models)
Introduction: Controlling Robots with LLMs (Large Language Models)
In recent years, the integration of Large Language Models (LLMs) with robotic systems has emerged as a significant advancement in the field of robotics. LLMs, such as those based on Transformer architectures, provide a powerful tool for interpreting natural language commands and converting them into actionable instructions for robots. This capability bridges the gap between human operators and robots, enabling more intuitive and flexible control mechanisms. LLMs work by understanding and processing human language in a way that allows them to generate meaningful responses or instructions. When applied to robotics, these models can interpret commands given in natural language, parse the intent, and map this intent to specific actions that the robot can execute. In summary, controlling robots with LLMs represents a promising approach that leverages the capabilities of advanced language models to simplify and enhance human-robot interaction. By enabling robots to understand and act on natural language commands, this technology has the potential to make robotic systems more accessible and easier to operate in various applications.
The AI (LLM) based codes generally requires a high-level sensory inputs and motion control structure, where ACROME's Smart Motion Devices (SMD) are very suitable for this purpose. SMD ecosystem provides a high-level, low-latency and an optimized command structure to control the electromechanical parts of a robotic system. This enables the AI generated code to operate seamlessly and reduces the back-end tasks of a programmer for combining the human-prompt commands into robot commands such as "move forward", "turn-left", "turn 60 degrees", "move at 5 km/h", "move until you see an obstacle", etc.
In the next sections, we will briefly explain the components of the solution and will provide real-world example applications for a human prompted robotic motion execution.
What is an LLM (Large Language Model)?
A Large Language Model (LLM) is a type of artificial intelligence (AI) model designed to understand, generate, and manipulate human language. These models are built using deep learning techniques, particularly based on architectures such as Transformers, which allow them to process and generate text with a high degree of fluency and coherence. Examples of real-world LLMs include ChatGPT (from OpenAI), Gemini (Google), Llama (Meta), and Bing Chat (Microsoft). GitHub's Copilot is another example, but for coding instead of natural human language.
Applications of LLMs (Large Language Models)
Large Language Models (LLMs) are versatile tools used across various fields due to their ability to understand and generate human language. In customer support, they power chatbots and virtual assistants, enabling more natural and effective interactions. In content creation, LLMs generate and edit text, making them valuable for writing, summarizing, and creative tasks. They also enhance machine translation by providing more accurate and context-aware translations. In fields like e-commerce and streaming, LLMs drive personalized recommendations by analyzing user behavior. Additionally, LLMs are crucial in coding, where they assist in generating code snippets and debugging, as well as in legal and medical fields for document analysis and summarization. Their applications extend to education, where they provide personalized tutoring and content generation, and robotics, where they enable natural language control of machines. Overall, LLMs are integral in improving efficiency and accuracy in a wide range of tasks.
What We Used the LLM For and How?
In our project, the Large Language Model (LLM) was employed primarily to enhance the interaction between the user and a robot by interpreting natural language commands and converting them into precise robotic actions. Here's how we utilized the LLM:
1. Natural Language Command Processing
• The LLM was responsible for processing the user's input, which was given in natural language. For instance, commands like "Move the robot forward by 10 centimeters" or "Turn the robot 90 degrees to the left" were interpreted by the LLM to understand the intended action.
• This natural language processing step involved breaking down the command into its components (e.g., action, direction, distance/angle) and understanding the context to determine the appropriate response.
2. Function Calling
• Once the LLM understood the user's intent, it mapped this intent to specific functions that were predefined in our robot control system. For example, the command to move the robot forward was mapped to the linear_movement function, while the command to turn the robot was mapped to the turn function.
• The LLM ensured that these functions were called in the correct sequence and with the appropriate parameters, allowing the robot to execute the desired movements accurately.
3. Automated Sequence of Actions
• For more complex tasks that required a sequence of actions, the LLM was capable of generating and organizing these steps. For example, if the user requested the robot to move in a circular path and then stop at a specific point, the LLM would generate the necessary sequence of commands (e.g., initializing the robot, performing radial movement, and then stopping) and execute them in the correct order.
• This capability was particularly useful for tasks that involved multiple steps or required the robot to perform a series of movements in a coordinated manner. In summary, we used the LLM to create a more intuitive and flexible interface for controlling a robot, allowing users to issue commands in natural language, which the LLM then translated into specific robotic actions. This approach not only simplified the interaction process but also enabled more complex and coordinated movements that would be challenging to program directly.
Example Application Details:
In our project, we implemented a practical application where the user prompts are transferred to the mobile robot and the robot executes the user's prompt into the motion. The application runs on 2 computational sides.
The client side of the AI application runs on the user's PC. It deals with the user inputs and communicates with a web-based LLM (Gemini is used but other LLM engines can also be implemented). It sends the user inputs to the Gemini using the Gemini API and recieves the responses from the Gemini LLM. The responses of the Gemini are parsed and then passed to the Pseudo-Function block. The Pseudo-function block converts the sequential responses of the AI to the respective robot functions and sends these robot functions to the mobile robot using the Flask API.
The robot side of the AI application runs on the Raspberry Pi computer located on the ACROME's SMD based mobile robot. Raspberry Pi controls the robot using the Python API of the SMD components. The robot consists of SMD Red - Brushed DC Motor Driver Module, SMD Ultrasonic Distance Sensor Module and various SMD Building Set parts. More information is provided in the next section.
In the flowchart below the specific components and methods used are depicted.
Robot Hardware Details
• Smart Motion Devices (SMD): The robot is equipped with ACROME's Smart Motion Devices, which provide high torque and precise positioning capabilities. These devices are crucial for the accurate control of the robot’s movements.
• DC Motors: The robot uses two DC motors driven by the SMD modules. These motors are responsible for the robot's mobility, allowing it to perform linear and radial movements as well as rotations.
• Raspberry Pi: The Raspberry Pi serves as the central control unit, running the Flask API that manages the robot's commands. It interfaces with the SMD modules through the SMD USB Gateway module and handles the communication with the Client Side PC through a wireless (or sometimes wired for small task) network. SMD products have a native Python API. More information is available at the GitBook pages of the SMD products.
• Power Supply: A battery pack powers both the Raspberry Pi and the motors, ensuring consistent operation during the robot's movement and control processes.
Software Details
In our project, the software layer plays a crucial role in controlling the robot, managing communication, and ensuring that commands are executed accurately and efficiently. As explained in the previous section, the software runs on 2 sides, ie. the Client-Side and the Robot-Side.
Below is a detailed breakdown of each software side and their functions.
Client-Side
Client-side, we created pseudo-functions to facilitate the interaction between the language model (LLM) and the robot's API. These functions are termed "pseudo" because they do not directly execute the robot's actions themselves but instead serve as abstract representations of the real functions that send POST requests to the robot's API. The primary reason for using pseudo-functions is to provide a structured and simplified way for the LLM to generate and understand control commands.
Pseudo-Function Design:
The pseudo-functions (init_robot, linear_movement, turn, and radial_movement) serve as the primary interface between the LLM and the robot's API. Each function is designed to encapsulate a specific robot movement or control command, with a detailed description that explains its purpose, functionality, parameters, and return values to the LLM.
• Initialization (init_robot): This function must be called first to initialize the robot's position and orientation. This ensures that all subsequent movements are based on a known starting point, which is crucial for accurate navigation and control.
• Movement Functions (linear_movement, turn, radial_movement): These functions control the robot's movements in various ways. linear_movement allows for straight-line travel, turn enables rotation around the robot's axis, and radial_movement handles circular paths. Each function returns the robot's updated position and orientation, providing feedback for the LLM to make informed decisions about subsequent actions.
Example Psuedo Function:
deflinear_movement(cm:int=0):"""
Controls the robot's movement in a straight line for a specified distance.
Functionality:
--------------
This function calculates the necessary encoder positions based on the given distance (in centimeters) and commands the motors to move the robot to those positions.
The motors are controlled to achieve the desired linear displacement by adjusting their position using the specified CPR (Counts Per Revolution) values.
Parameters:
-----------
cm : int, optional
The distance in centimeters the robot should move.
- A positive value makes the robot move forward.
- A negative value makes the robot move backward.
Return Value:
-------------
This function will return the robot's updated position and orientation as a tuple (x, y, angle),
where x and y are the new coordinates in centimeters, and angle is the orientation in degrees.
Example:
--------
To move the robot forward by 10 centimeters:
>>> linear_movement(10)
(10, 0, 0)
"""out=requests.post(robotip, json={"id": "1", "cm": cm})
data = out.json()
x = data["x"]
y = data["y"]
angle = data["angle"]
return x, y, angle
Integration with LLM:
We used Google’s Gemini API for this project. The Gemini API provides access to Google's advanced generative AI models, enabling developers to build applications that can process and generate various forms of data, including text, images, code, and audio. The pseudo-functions are integrated into the generative model API that the LLM can use to generate control algorithms. The model is provided with the pseudo-functions and instructed to use them in a way that accomplishes the user's goals.
Guidance for LLM:
The system instructions emphasize the importance of using the init_robot function before any other control functions. This instruction ensures that the LLM correctly initializes the robot's position before attempting any movement, thereby avoiding errors related to undefined or inaccurate starting coordinates.
Additionally, the LLM is encouraged to use the functions creatively to achieve the desired results. This open-ended instruction allows the LLM to explore various combinations of movements and rotations to accomplish complex tasks, such as navigating through an environment or performing specific maneuvers.
model = genai.GenerativeModel(
model_name="gemini-1.5-flash",
tools=[init_robot,turn, linear_movement, radial_movement],
system_instruction="""By using the functions given create
algorithms that will do what the user wants.
Use the functions in a creative way to achieve the desired result.
Do not forget to use the functions in the correct order to achieve
the desired result. Do not forget to use init_robot function before
using other functions.""")
The User Interface (UI)
In this project, Streamlit was used to create a simple and an interactive web-based interface, running on the user's computer that allows the user to control the robot via natural language commands.
The user interface has 3 sections:
Prompt Text Input Field: Streamlit provided an easy way to create a user-friendly interface where users could enter their commands in a text input field. This input is then processed by the LLM to generate the corresponding robot control instructions.
Submit Button: A submit button was added to trigger the processing of the command. When the user clicks this button, the command is sent to the LLM for interpretation and execution.
Result Output Field: After the robot motion executes, the UI gives a textual output about the result of the execution.
Robot Side of the Software
For the robot-side, we used a Flask-Based RESTful API and the native Python motion & sensor commands of the ACROME's Smart Motion Devices. You may find below the details of the each component.
Flask-Based RESTful API
Flask API: The robot is controlled through a Flask-based RESTful API hosted on the Raspberry Pi. This API acts as the communication layer between the user and the robot's hardware, receiving HTTP POST requests and executing corresponding control functions.
• init_robot(cm=0): Initializes the robot’s position and orientation, setting the starting coordinates to (0, 0) and the angle to 0 degrees. This function must be called before any other movement functions to establish the robot’s initial state.
• linear_movement(cm): Moves the robot in a straight line by the specified distance (in centimeters). A positive value moves the robot forward, while a negative value moves it backward.
• turn(degree): Rotates the robot by the specified number of degrees. Positive values result in a left turn, and negative values result in a right turn.
• radial_movement(radius, degree): Moves the robot along a circular path defined by a radius and an angle. Positive radius values cause the robot to move in a
rightward arc, while negative values create a leftward arc.
API Endpoint Structure
The Flask API defines a primary endpoint /execute that handles different robot control commands based on the JSON data sent in the HTTP request. The API parses the command and calls the corresponding control function.
The SMD Python library facilitates user-friendly access to ACROME's Smart Motion Device products, catering to programmers of all skill levels and supporting a wide range of projects. It allows for both simple tasks, such as motor speed adjustments, and complex operations like precise positioning and PID auto-tuning, all utilizing Python's versatility. Additionally, the library streamlines the integration of SMD sensor modules, simplifying the use of third-party sensors and enhancing project functionality and efficiency. Users only require a simple computational unit like Raspberry Pi for running the Python scripts to develop their projects.
Example Robot Motion Prompts and Robot's Execution
In this section, 3 example robot motion prompts and videos related with these examples are given. Please note that in each video the pre-mentioned User Interface is used to enter the human-prompt, and the software executes accordingly. The video is taken with a single camera with this 3 consecutive steps in real-time:
The PC screen is recorded with the camera after the prompt is given
Robot's motion is recorded next with the same camera
The PC screen is re-recorded as soon as the robot's motion execution finishes
Here are the example prompts and their respective videos:
Prompt #1: Draw a circle with radius of 40 cm.
Prompt #2: Draw a triangle with each side is 50 cm by moving the robot
Prompt #3: Draw a square each side is 50 cm by moving the robot
Application Areas of the Project's Methods and Future Developments
The methods employed in this project, particularly the integration of Large Language Models (LLMs) with robotic control systems, are highly adaptable and can be applied across various domains. The core strength of these methods lies in their flexibility and the natural language interface they provide, which can significantly simplify the interaction between humans and robots.
The methods used in this project are applicable across various fields, including industrial automation, healthcare, service, education, and research. LLMs enable intuitive and flexible control of robots through natural language, allowing for tasks such as remote operation in hazardous environments, patient assistance, customer interaction, interactive learning, and rapid prototyping. These applications demonstrate the versatility of LLMs in enhancing both human-robot interaction and the overall efficiency of robotic systems in diverse settings.
Future developments in the use of LLMs for robotic control could focus on enhancing autonomy through integration with advanced AI algorithms, improving contextual understanding for handling complex tasks, and enabling multimodal interaction by combining language with other inputs like gestures or visual data. Scalability can be achieved through cloud integration, allowing resource-constrained robots to utilize powerful LLMs. Security and ethical considerations will be crucial as robots become more integrated into daily life, ensuring safe and responsible deployment. Additionally, enhancing human-robot collaboration and adaptive learning capabilities will improve teamwork and efficiency in various applications.
Oops! Something went wrong while submitting the form.
Discover Acrome
Acrome was founded in 2013. Our name stands for ACcessible RObotics MEchatronics. Acrome is a worldwide provider of robotic experience with software & hardware for academia, research and industry.