Facebook Researchers Develop TransCoder AI That Converts Code from One Programming Language into Another

  • Facebook researchers say they’ve developed what they call a neural transcompiler, a system that converts code from one high-level programming language like C++, Java, and Python into another.

  • Migrating an existing codebase to a modern or more efficient language like Java or C++ requires expertise in both the source and target languages, and it’s often costly.

  • The Facebook researchers trained TransCoder on a public GitHub corpus containing over 2.8 million open source repositories, targeting translation at the function level.


Facebook researchers say they’ve developed what they call a neural transcompiler, a system that converts code from one high-level programming language like C++, Java, and Python into another. It’s unsupervised, meaning it looks for previously undetected patterns in data sets without labels and with a minimal amount of human supervision, and it reportedly outperforms rule-based baselines by a “significant” margin.
 

Migrating an existing codebase to a modern or more efficient language like Java or C++ requires expertise in both the source and target languages, and it’s often costly. For example, the Commonwealth Bank of Australia spent around $750 million over the course of five years to convert its platform from COBOL to Java. Transcompilers could help in theory — they eliminate the need to rewrite code from scratch — but they’re difficult to build in practice because different languages can have a different syntax and rely on distinctive platform APIs, standard-library functions, and variable types.


Facebook’s system — TransCoder, which can translate between C++, Java, and Python — tackles the challenge with an unsupervised learning approach. TransCoder is first initialized with cross-lingual language model pretraining, which maps pieces of code expressing the same instructions to identical representations regardless of programming language. (Input streams of source code sequences are randomly masked out, and TransCoder is tasked with predicting the masked-out portions based on context.) A process called denoising auto-encoding trains the system to generate valid sequences even when fed with noisy input data, and back-translation allows TransCoder to generate parallel data that can be used for training.


The cross-lingual nature of TransCoder arises from the number of common tokens — anchor points — existing across programming languages, which come from common keywords like “for,” “while,” “if,” and “try” and also digits, mathematical operators, and English strings that appear in the source code. Back-translation serves to improve the system’s translation quality by coupling a source-to-target model with a “backward” target-to-source model trained in parallel. The target-to-source model is used to translate target sequences into the source language, producing noisy source sequences, while the source-to-target model helps to reconstruct the target sequences from the noisy sources until the two models converge.


Read More: FACEBOOK EMPLOYEES STAGE VIRTUAL WALKOUT AFTER MARK ZUCKERBERG DEFENDS HANDS-OFF TRUMP POLICY


The Facebook researchers trained TransCoder on a public GitHub corpus containing over 2.8 million open source repositories, targeting translation at the function level. (In programming, functions are blocks of reusable code that are used to perform a single, related action.) After pretraining TransCoder on all source code available, the denoising auto-encoding and back-translation components were trained on functions only, alternating between the components with batches of around 6,000 tokens.
 

To evaluate TransCoder’s performance, the researchers extracted 852 parallel functions in C++, Java, and Python from GeeksforGeeks, an online platform that gathers coding problems and presents solutions in several programming languages. Using these, they developed a new metric — computational accuracy — that tests whether hypothesis functions generate the same outputs as a reference when given the same inputs.


Facebook notes that while the best-performing version of TransCoder didn’t generate many functions strictly identical to the references, its translations had high computational accuracy. They attribute this to the incorporation of beam search, a method that maintains a set of partially decoded sequences that are appended to form sequences and then scored so the best sequences bubble to the top:
 

• When translating from C++ to Java, 74.8% of TransCoder’s generations returned the expected outputs.

• When translating from C++ to Python, 67.2% of TransCoder’s generations returned the expected outputs.

• When translating from Java to C++, 91.6% of TransCoder’s generations returned the expected outputs.

• When translating from Python to Java, 56.1% of TransCoder’s generations returned the expected outputs.

• When translating from Python to C++, 57.8% of TransCoder’s generations returned the expected outputs.

• When translating from Java to Python, 68.7% of TransCoder’s generations returned the expected outputs.


According to the researchers, TransCoder demonstrated an understanding of the syntax specific to each language as well as the languages’ data structures and their methods during experiments, and it correctly aligned libraries across programming languages while adapting to small modifications (like when a variable in the input was renamed). And while it wasn’t perfect — TransCoder failed to account for certain variable types during generation, for example — it outperformed frameworks that rewrite rules manually built using expert knowledge.

TransCoder can easily be generalized to any programming language, does not require any expert knowledge, and outperforms commercial solutions by a large margin,” the coauthors wrote. “Our results suggest that a lot of mistakes made by the model could easily be fixed by adding simple constraints to the decoder to ensure that the generated functions are syntactically correct, or by using dedicated architectures.


Facebook isn’t the only organization developing code-generating AI systems. During Microsoft’s Build conference earlier this year, OpenAI demoed a model trained on GitHub repositories that uses English-language comments to generate entire functions. And two years ago, researchers at Rice University created a system — Bayou — that’s able to write its own software programs by associating “intents” behind publicly available code.

 

“[Programs like these are] really just trying to eliminate the minutiae of creating software,” principal scientist and director at Intel Labs Justin Gottschlich told VentureBeat in a recent interview. “[They] could help accelerate productivity … [by taking care of] bugging. [And they could] increase the number of jobs [in tech] because people who don’t have a programming background will be able to take their creative intuition and capture that via machine by these intentionality interfaces.”


Read More: FACEBOOK, AWS JOIN FORCES FOR TORCHSERVE UPGRADE ON PYTORCH 1.5

Spotlight

Other News
AI Tech

AI and Big Data Expo North America announces leading Speaker Lineup

TechEx Events | March 07, 2024

AI and Big Data Expo North America announces new speakers! SANTA CLARA, CALIFORNIA, UNITED STATES, February 26, 2024 /EINPresswire.com/ -- TheAI and Big Expo North America, the leading event for Enterprise AI, Machine Learning, Security, Ethical AI, Deep Learning, Data Ecosystems, and NLP, has announced a fresh cohort of distinguishedspeakersfor its upcoming conference at the Santa Clara Convention Center on June 5-6, 2024. Some of the top industry speakers set to take the stage are: - Sam Hamilton - Head of Data & AI – Visa - Dr Astha Purohit - Director - Product (Tech) Ops – Walmart - Noorddin Taj - Head of Architecture and Design of Intelligent Operations - BP - Temi Odesanya - Director - AI Governance Automation - Thomson Reuters - Katie Sanders - Assistant Vice President – Tech - Union Pacific Railroad - Prasanth Nandanuru – SVP - Wells Fargo - Rodney Brooks - Professor Emeritus - MIT These esteemed speakers bring a wealth of knowledge and expertise to an already impressive lineup, promising attendees a truly enlightening experience. In addition to the speakers, theAI and Big Data Expo North Americawill feature a series of presentations covering a diverse range of topics in AI and Big Data exploring the latest innovations, implementations and strategies across a range of industries. Attendees can expect to gain valuable insights and practical strategies from presentations such as: How Gen AI Positively Augments Workforce Capabilities Trends in Computer Vision: Applications, Datasets, and Models Getting to Production-Ready: Challenges and Best Practices for Deploying AI Ensuring Your AI is Responsible and Ethical Mitigating Bias and Promoting Fairness in AI Systems Security Challenges in the Era of Gen AI and Data Science AI for Good: Social Impact and Ethics Selling Data Democratization to Executives Spreading Data Insights across the Business Barriers to Overcome: People, Processes, and Technology Optimizing the Customer Experience with AI Using AI to Drive Growth in a Regulated Industry Building an MLOps Foundation for AI at Scale The Expo offers a platform for exploration and discovery, showcasing how cutting-edge technologies are reshaping a myriad of industries, including manufacturing, transport, supply chain, government, legal sectors, financial services, energy, utilities, insurance, healthcare, retail, and more. Attendees will have the chance to witness firsthand the transformative power of AI and Big Data across various sectors, gaining insights that are crucial for staying ahead in today's rapidly evolving technological landscape. Anticipating a turnout of over 7000 attendees and featuring 200 speakers across various tracks, AI and Big Data Expo North America offers a unique opportunity for CTO’s, CDO’s, CIO’s , Heads of IOT, AI /ML, IT Directors and tech enthusiasts to stay abreast of the latest trends and innovations in AI, Big Data and related technologies. Organized by TechEx Events, the conference will also feature six co-located events, including the IoT Tech Expo, Intelligent Automation Conference, Cyber Security & Cloud Congress, Digital Transformation Week, and Edge Computing Expo, ensuring a comprehensive exploration of the technological landscape. Attendees can choose from various ticket options, providing access to engaging sessions, the bustling expo floor, premium tracks featuring industry leaders, a VIP networking party, and a sophisticated networking app facilitating connections ahead of the event. Secure your ticket with a 25% discount on tickets, available until March 31st, 2024. Save up to $300 on your ticket and be part of the conversation shaping the future of AI and Big Data technologies. For more information and to secure your place at AI and Big Data Expo North America, please visit https://www.ai-expo.net/northamerica/. About AI and Big Data Expo North America: The AI and Big Data Expo North America is a leading event in the AI and Big Data landscape, serving as a nexus for professionals, industry experts, and enthusiasts to explore and navigate the ever-evolving technological frontier. Through its focus on education, networking, and collaboration, the Expo continues to be a beacon for those eager to stay at the forefront of technological innovation. “AI and Big Data Expo North Americais a part ofTechEx. For more information regardingTechExplease see onlinehere.”

Read More