A quick set of links to help convert webpages or even git repositories into a single markdown file for easier absorption by LLMs. Now how does that help a LLM better understand data you might ask? Well, by putting the content into a structured format the LLM will know what every piece of content is supposed to be and a better understanding of how one piece of information relates to another. Markdown makes it easy for the models to quickly see what are the main topics or sections based on heading tags (i.e. H1 tags for the main topic, H2 for primary topics, H3 for sub topics, etc..) as well as indicates the format of the information as it’s displayed. The easiest way to think about it is structured data provides a “roadmap” of sorts to the content being ingested. To be able to learn from the data and be able to associate it with data from other sources it tries to parse or “chunk” it into various associated groupings. This is all an extremely basic description of how the knowledge is managed in these systems but the purpose of this note isn’t to go into the “Why?” so much as it is to share a few tools I’ve recently found useful to accomplish this.
Combining Git Repositories Into A Single File - GitIngest
Some of the recent AI programming assistants have had mixed results when faced with learning from entire code repositories and one idea that’s been gaining traction is to combine the files into a singular source so that it “hopefully” won’t get confused as to how a piece of software or app works. One of the easiest tools I’ve come across is an open source project by Romain Courtois called GitIngest . You can use the online app over at GitIngest.com , self-host it via the provided docker containers, run it locally or even use it as a python library in your own projects. It comes with it’s own simple front-end, ability to restrict allowed hostnames, ignore file-types, include files only under a certain size, basic token estimation and it even lets you preview the output before downloading the final compiled file. Pretty useful if you ask me.
Converting a Website Into Markdown Content
So there are a bunch of tools out there that will let you convert single pages and even entire websites into a standardized Markdown format as well as few open-source projects that can help you write your own tools to do the same. Let’s break down the basic process behind it:
- First you’re going to need to scrape the webpage.
- Next you’re going to need to parse out all the information you don’t want or need like ads, tracking scripts, headers/footers and if possible most of the navigational UI.
- Once you’ve done that then you need to convert that content into Markdown.
While the basics behind the format are pretty standardized there are a few popular variations you will come across like MultiMarkdown, Markdown Extra, Pandoc and GFM (Github Flavored Markdown) in addition to some sites like Reddit and StackOverflow that use site-specific versions of it for their own functionality.
Tools I’ve Used To Convert Websites to Markdown For LLM Use
Here’s a couple free tools, most that you can host yourself, that will help. I’ve used all of them at some point in the last year and as I’ll get into, some work better than others for certain kinds of content/websites.
MarkdownDown - This is a great tool, and the one I’ve lately been using the most, to scrape and convert websites into Markdown. Asad Memon has it hosted for free to use on Vercel as well as the full source code on his Github. It uses Puppeteer to scrape the page, Turndown to convert it to the Markdown format and the Mozilla Readability library to clean up the content of any unnecessary un-desired crap. It has the option to add in a Browserless API key so you don’t have to scrape it yourself as well as includes the option to spin up a Cloudflare worker who uses the Browser Rendering API (which requires the $5/month paid Cloudflare Worker plan) to do the same. It has a nice front-end that let’s you incorporate your own OpenAI key if you want to have it manipulate the content before outputting it to markdown as well as the option to download the images (and re-link them) locally. If you’re trying to scrape pages like reddit there is currently a working pull-request by laggingreflex which enables the puppeteer-extra stealth plugin to circumvent blocks.
HeckYesMarkdown.com - Created by Brett Terpstra , heckyesmarkdown will scrape a provided url and output the contents in a variety of customizable formats ranging from multiple types of Markdown to HTML, LaTeX, OpenDocument, Plaintext, RTF and more than a dozen others. It provides a convenient bookmarklet you can put in your browser to use it on the go and displays the results in an easy to copy format with the ability to preview what the rendered markdown will look like in a browser. It also let’s you import the result directly into apps like Obsidian (which I love). He also was kind enough to provide an API in case you wanted to incorporate the service in your own scripts.
URLtoMarkdown.com - This is a great open-source tool using Turndown, Readability, jsdom and Heroku to convert a URL into markdown and allows you to ignore links and/or include the page title in the result. You can use the tool on their homepage or go to their Github to host your own. One of the best features of this site is it also provides a web service that will convert a URL to markdown simply by adding it to the end of their Heroku app (that you can host yourself)- i.e. https://urltomarkdown.herokuapp.com/?url=https://example.com
Jina.AI Reader API - Jina.AI provides a free reader API and web-service that can convert URL’s into formats like Markdown, TXT, HTML and even a screenshot. While you don’t need an API key to use it, you are rate-limited by them if you don’t. Luckily they provide a free API key with 1 million credits on it (yes, 1 million) just by visiting their website (no registration/login/credit-card required). I hope they don’t read this but it’s also just as easy to get a new one once you’ve used that up just by clearing your cookies and revisiting the same page. It let’s you grab the entire website, select a specific CSS target to include or exclude, remove all the images, gather all the links and/or images at the end, forward custom cookies, use a proxy service and a gaggle of other useful options for hard-to-scrape sources like Reddit or dynamic javascript-based sites. With that same API key you can also use a few of their other API-based products like their AI re-ranker, zero-shot classification tool and more. To use their free web-service just add your desired URL at the end of r.jina.ai like this: https://r.jina.ai/https://example.com
Firecrawl.dev - Firecrawl is a pretty popular tool that offers both an open-source model that you can self-host as well as a fully-featured website API tool that is designed to scrape, crawl and extract entire websites or even libraries and convert them into LLM-ready structured data. You can save the content as Markdown, HTML, screenshots or use their custom Extract LLM service to export it in LLM-specific formats (still in beta). It’s very fast, works on most if not all websites I’ve thrown at it and if you don’t want to self-host it they provide you with 500 credits for free to test it out. Firecrawl is used in a LOT of the big frameworks that you’ll find out there and some tend to lean on it pretty hard as their go-to LLM data rendering/scraping/extraction service. If you’d rather just pay a small fee to have everything taken care of for you in an easy, ready-made platform this would probably be the one to look into. They have an endless list of features and integrations, incorporate all sorts of different media parsing into their runs and allow you to set up crawl/scrape jobs without needing much technical knowledge. They outline the big differences between their open-source offering and their paid cloud service on their docs page here .
Webscrapbook - Webscrapbook is an opensource browser extension available for both Chromium-based browsers (Chrome , Edge , Opera, Vivaldi, etc..) as well as Firefox-based browsers (Firefox , Tor, Waterfox) that let’s you save any website in a variety of formats as well as manipulate and notate the content before saving it. You can also have it save the content both locally on your own device or remotely so it can be easily shared with others. Mobile friendly, backwards compatible with previous legacy versions of Scrapbook X and offers the ability to take your saved pages and generate a static site index that can be used as it’s own static website (which can come in handy when ripping data for purposes other than LLM-use).
If you search on GitHub or even google for “website to structured data converter” you can see there is an endless supply of both paid and open-source tools available that all have their own pros/cons. Using a combination of the ones above I’ve haven’t found any source that I wasn’t able to easily convert to markdown. Hope that helps.
Cheers,
Brent