In this tutorial you'll learn how to open and convert Word documents (.doc) to the popular formats RTF, HTML, TXT and XML. You will also find yourself learning the basics of operating with COM objects. Advertisement
Opening and manipulating Microsoft Office Word documents (.doc) can be done rather easily using the .NET Framework. You are capable of opening, editing and creating Word documents with only a few lines of code. However, since classes for managing the Word document format are not available in the .NET Framework, the solution is to reference COM objects into your project. The downside of this is that to be able to manage the Word documents with the application we’re going to create in this tutorial, the user running it will need to have Microsoft Word installed, preferably the same version that we designed the application for.
In this tutorial the application was designed and tested to work with Microsoft Office version 11, more exactly Microsoft Office Word 2003. On other recent versions, the application is likely to work but it may require a few changes, especially the Open() and SaveAs() functions which probably differ. Therefore if you find the project attached doesn’t work on your system, and you don’t have Microsoft Office 2003 installed, that’s probably the cause.
Just to make things clear: there is a way to open, edit and save Word documents without requiring the Word application to be installed, however the task of building such an application would require an entire team of experienced programmers where a language such as C++ might prove more efficient, since it involves creating your application from scratch, i.e. to create your own .doc parser – unless you find a 3rd party component that does that.
Start by creating a C# Windows application project. Add a total of 6 buttons and one label. Name them btnOpen, btnClose, btnToHtml, btnToRTF, btnToText, btnToXml and the label lblFilePath. Disable the four convert buttons and the close button (btnClose) by setting the Enabled property to false. We will enable them once the user chooses a file to convert. Now there’s two more controls you need to add to the project, via the Visual Studio Toolbox: an OpenFileDialog and a SaveFileDialog. Name them openDoc and saveDoc. The first dialog (openDoc) we will use to open the MS Word Document that we want to convert, thus we want to restrict the user to choosing only a Microsoft Word type of document (.doc), and to do that go ahead and change the Filter property of the OpenFileDialog to the following value:
Word Document|*.doc
This assures us that the user will only be able to select a Word Document. For more details on this object, please see the Using OpenFile Dialog to open files tutorial.
As for the other dialog – saveDoc, we’re not going to define a filter right now, because the file type to which we’re going to save depends on what button the user clicks (To HTML, To RTF, etc.). We’re going to define the filter when the user clicks the button, because at that time we know the extension.
Now let’s start doing what we need to do to open an Word document. Right click the project name in Solution Explorer and choose Add Reference. Switch to the COM tab and scroll down until you find Microsoft Word 11.0 Object Library. If you don’t have this item listed, you probably don’t have Microsoft Office installed so unfortunately the tutorial ended for you here. In case you see a different version of the object library such as Microsoft Word 10.0 Object Library or Microsoft Word 9.0 Object Library, it means you have an older version of Office. Normally you should be able to adjust the code from this tutorial to match your Word version, easily.
After you add the Word Object Library to your project, in Solution Explorer you will see some new items were added:
Now that we have Microsoft.Office.Core, VBIDE and Word added as a reference we are ready to start coding. Switch to code view, and the first thing we want to do is create three objects in the Form1 class, right above the constructor:
private Word. ApplicationClass WordApp; |
The first object is the Word Application Class, which we can access thanks to the COM reference we added earlier. We’re going to use this to start the Microsoft Word engine, which will do the work of converting the document to the other formats. WordApp will also be the one opening the document; the document will then be stored inside WordDoc – which is the the second object we create.
The third object seems kind of odd – it’s an object of the type Missing. The functions we are going to call for opening and saving the document will take a handful or parameters, but we’ll only want to specify a few of them. For the other parameters that we don’t have any values to pass to, we’re going to pass this missing object – as in “parameter is missing”.
The reason for this small inconvenience is that the COM object was meant to be used mainly with the VisualBasic language where there is no method overloading, overriding or constructors. Visual Basic is also more permissive and allows the user to skip some parameters. In C# we can’t skip these parameters and we’ll have to specify a missing parameter, similar to specifying null.
Now that we have these objects ready, we can open the Word document. To do that, double-click btnOpen to create its Click event handler. Use the following code:
private void btnOpen_Click(object sender, EventArgs e) { // Create an instance of the Word Application WordApp = new Word.ApplicationClass (); // We don't want to display the Microsoft Word window WordApp.Visible = false; // If the user choosed a path of the file to open if (this.openDoc.ShowDialog() == DialogResult.OK) // Set the label to the new file path lblFilePath.Text = openDoc.FileName; // Enable the convert and close buttons, since now we have a document opened btnToHtml.Enabled = true; btnToRTF.Enabled = true; btnToText.Enabled = true; btnToXml.Enabled = true; btnClose.Enabled = true; // Create and set the objects we're going to pass to the Open() function object DocFileName = openDoc.FileName; object DocReadOnly = false; object DocVisible = true; // Open the document by passing the path WordDoc = WordApp.Documents.Open(ref DocFileName, ref DocNoParam, ref DocReadOnly, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocVisible, ref DocNoParam, ref } } |
The above code opens the Word document specified by the user in the OpenFileDialog window, enables the convert and close buttons and sets the label to the path of the file just so that we remember which file is opened.
As we discussed before, we pass a handful of values to the parameters of the Documents.Open method, but to most of them with pass the reference of DocNoParam which contains Type.Missing, meaning plain and simple that we don’t want to pass anything to that parameter. The Office COM object was designed with the Visual Basic language in mind, that’s why this line in Visual Basic would be about 10 times shorter since we would only have to pass values to the parameters that we are interested in.
Now that we have the Word document opened and we can manipulate it as you we want, let’s accomplish the main task of our program and save this document with different formats. The first button is supposed to save to HTML, so double-click it to get to the click event handler and use the following code:
btnToHtml_Click(object sender, EventArgs e) { // Suggest a path for saving saveDoc.FileName = @"C:\Test Document.html"; // The file extension to which we want to save saveDoc.Filter = "HTML Files|*.html"; // If the user choosed a path where to save the file if(this.saveDoc.ShowDialog() == DialogResult.OK) { // Set the save path object object SaveToPath = saveDoc.FileName; // Set the format type to HTML (wdFormatHTML) object SaveToFormat = Word.WdSaveFormat.wdFormatHTML; // Save the document to the specified path and format WordDoc.SaveAs(refSaveToPath, ref SaveToFormat, refDocNoParam, refDocNoParam, refDocNoParam, refDocNoParam, refDocNoParam, refDocNoParam, refDocNoParam, ref |
As you can see in the code above, when btnToHtml is clicked we prompt the user to save the document in the HTML format. The whole magic is in the object
SaveToFormat = Word.WdSaveFormat.wdFormatHTML; line where specify the format we wish to use. In this case we specify wdFormatHTML to save the file as an HTML document. Upon clicking this button, the document will be converted from its specific .doc format to HTML tags. Along with the HTML file, sometimes there is also a folder created that holds the pictures for that document, referenced in the HTML document.
From the remaining 3 buttons the code get repetitive, with only a few changes to adjust the different extension.
The C# code for converting to RTF:
{ // Suggest a path for saving saveDoc.FileName = @"C:\Test Document.rtf"; // The file extension to which we want to save saveDoc.Filter = "RTF Files|*.rtf"; // If the user choosed a path where to save the file |
The C# code for converting to plain text:
{ // Suggest a path for saving saveDoc.FileName = @"C:\Test Document.txt"; // The file extension to which we want to save saveDoc.Filter = "Text Files|*.txt"; // If the user choosed a path where to save the file if(this.saveDoc.ShowDialog() == DialogResult.OK) { // Set the save path object object SaveToPath = saveDoc.FileName; // Set the format type to TXT (wdFormatText) object SaveToFormat = SaveToFormat = Word.WdSaveFormat.wdFormatText; // Save the document to the specified path and format WordDoc.SaveAs(ref SaveToPath, ref SaveToFormat, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam); } } |
The C# code for converting to XML:
{ // Suggest a path for saving saveDoc.FileName = @"C:\Test Document.xml"; // The file extension to which we want to save saveDoc.Filter = "XML Files|*.xml"; // If the user choosed a path where to save the file if(this.saveDoc.ShowDialog() == DialogResult.OK) { // Set the save path object object SaveToPath = saveDoc.FileName; // Set the format type to XML (wdFormatXML) object SaveToFormat = SaveToFormat = Word.WdSaveFormat.wdFormatXML; // Save the document to the specified path and format WordDoc.SaveAs(ref SaveToPath, ref SaveToFormat, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref } |
There’s one last thing we need to do. Unless we close each document after we open it, instances of WinWord.exe will remain in memory, so obviously you’ll want to press the close button before opening another document or closing the application. In the click event handler of btnClose we tell Word to close the document and to not save any changes:
private void btnClose_Click(object sender, EventArgs e) { // Since we don't want to save changes to the original document object SaveChanges = false; // Close the document, save no changes WordDoc.Close(ref SaveChanges, ref DocNoParam, ref DocNoParam); |
Here is the entire application code in case you want to have an overall look:
using System; using System.Collections.Generic; using System.Data; using System.Drawing; using System.Text; using System.Windows.Forms; namespace OpenWord { public partial class Form1: Form { private Word.ApplicationClass WordApp; private Word.Document WordDoc; private object DocNoParam = Type.Missing; public Form1() { InitializeComponent(); } private void btnOpen_Click(object sender, EventArgs e) { // Create an instance of the Word Application WordApp = new Word.ApplicationClas(); // We don't want to display the Microsoft Word window WordApp.Visible = false; // If the user choosed a path of the file to open if(this.openDoc.ShowDialog() == DialogResult.OK) { // Set the label to the new file path lblFilePath.Text = openDoc.FileName; // Enable the convert and close buttons, since now we have a document btnToHtml.Enabled = true; btnToRTF.Enabled = true; btnToText.Enabled = true; btnToXml.Enabled = true; btnClose.Enabled = true; // Create and set the objects we're going to pass to the Open() function object DocFileName = openDoc.FileName; object DocReadOnly = false; object DocVisible = true; // Open the document by passing the path WordDoc = WordApp.Documents.Open(ref DocFileName, ref DocNoParam, ref DocReadOnly, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocVisible, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam); } } private void btnToHtml_Click(object sender, EventArgs e) { // Suggest a path for saving saveDoc.FileName = @”C:\Test Document.html”; // The file extension to which we want to save saveDoc.Filter = “HTML Files|*.html”; // If the user choosed a path where to save the file if(this.saveDoc.ShowDialog() == DialogResult.OK) { // Set the save path object object SaveToPath = saveDoc.FileName; // Set the format type to HTML (wdFormatHTML) object SaveToFormat = SaveToFormat = Word.WdSaveFormat.wdFormatHTML; // Save the document to the specified path and format WordDoc.SaveAs(ref SaveToPath, ref SaveToFormat, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam); } } private void btnToRTF_Click(object sender, EventArgs e) { // Suggest a path for saving saveDoc.FileName = @”C:\Test Document.rtf”; // The file extension to which we want to save saveDoc.Filter = “RTF Files|*.rtf”; // If the user choosed a path where to save the file if(this.saveDoc.ShowDialog() == DialogResult.OK) { // Set the save path object object SaveToPath = saveDoc.FileName; // Set the format type to RTF (wdFormatRTF) object SaveToFormat = SaveToFormat = Word.WdSaveFormat.wdFormatRTF; // Save the document to the specified path and format WordDoc.SaveAs(ref SaveToPath, ref SaveToFormat, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam); } } private void btnToText_Click(object sender, EventArgs e) { // Suggest a path for saving saveDoc.FileName = @”C:\Test Document.rtf”; // The file extension to which we want to save saveDoc.Filter = “RTF Files|*.rtf”; // If the user choosed a path where to save the file if(this.saveDoc.ShowDialog() == DialogResult.OK) { // Set the save path object object SaveToPath = saveDoc.FileName; // Set the format type to RTF (wdFormatRTF) object SaveToFormat = SaveToFormat = Word.WdSaveFormat.wdFormatText; // Save the document to the specified path and format WordDoc.SaveAs(ref SaveToPath, ref SaveToFormat, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam); } } private void btnToXml_Click(object sender, EventArgs e) { // Suggest a path for saving saveDoc.FileName = @”C:\Test Document.xml”; // The file extension to which we want to save saveDoc.Filter = “XML Files|*.xml”; // If the user choosed a path where to save the file if(this.saveDoc.ShowDialog() == DialogResult.OK) { // Set the save path object object SaveToPath = saveDoc.FileName; // Set the format type to XML (wdFormatXML) object SaveToFormat = SaveToFormat = Word.WdSaveFormat.wdFormatXML; // Save the document to the specified path and format WordDoc.SaveAs(ref SaveToPath, ref SaveToFormat, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam, ref DocNoParam); } } private void btnClose_Click(object sender, EventArgs e) { // Since we don’t want to save changes to the original document object SaveChanges = false; // Close the document, save no changes WordDoc.Close(ref SaveChanges, ref DocNoParam, ref DocNoParam); } } } |